> Neural Networks are non-linear because of their activation functions. You need...

candiodari · on Nov 7, 2017

> Activation functions introduce nonlinearity so that deep learning methods can hopefully learn things linear methods can't.

I would just like to clarify: deep learning methods definitely DO learn things that linear methods can't.

Using linear functions, no matter how many layers, essentially boils down to the "perceptron" architecture. You can Google it and they will be mostly talking about it's limitations (for instance, it's famously unable to learn XOR, Google "perceptron XOR", essentially the issue is that XOR is not linearly separable, so it can only be expressed as a nonlinear function).

You can give a simple proof for this. If you examine what a neural network layer (without activation function) does, in matrix terms. You take the input vector X (1xn), the layer weights W (nxm), and the output vector O (1xm).

Then computing the layer output is simply O = WX.

Now we can envision what happens with two layers. W (mxn) and V(mxn). The output of running two layers then becomes:

O = V(WX)

However, matrix multiplication is associative:

O = V(WX) = (VW)X, and VW is just a matrix

So for every 2 layer linear neural network there is a 1 layer neural network that gives the exact same result. So there's no reason to have 2 layers.

A famous result proving this is the "deep" networks (with activation funcions) are universal function approximators (explained here: [1]). Note that this "deep" should be understood in the 1995 meaning of deep neural networks, which is essentially "at least 2 layers", and usually "exactly 2 layers", not the 2010+ one where one means 10 to 200 layer deep networks.

[1] http://neuralnetworksanddeeplearning.com/chap4.html

inputcoffee · on Nov 8, 2017

Wait a second, a 2-layer perceptron can solve the XOR problem. The first layer maps it to a linearly separable problem which the second layer solves.

Furthermore, a non-linear 2-layer node can map any arbitrary function so the argument for more nodes is learnability.

https://en.wikipedia.org/wiki/Universal_approximation_theore...

candiodari · on Nov 8, 2017

Given that every perceptron layer does a linear transformation of the input, I have to disagree. There is no linear function that separates XOR, and that also means there is no sequence of linear functions that separates XOR.

That means there is no 1-layer perceptron that can learn XOR, and there is no multilayer perceptron that can learn XOR.

inputcoffee · on Nov 9, 2017

The XOR problem is famously solvable by adding a layer to a single layer perceptron assuming a unit step function. This is a very basic exercise taught in many intro courses.

I agree with every second sentence: Given that every perceptron layer does a linear transformation of the input.<- True There is no linear function that separates XOR, and that also means there is no sequence of linear functions that separates XOR. <- False That means there is no 1-layer perceptron that can learn XOR, <- True and there is no multilayer perceptron that can learn XOR.<-False

Please look it up. Here are a few links:

http://toritris.weebly.com/perceptron-5-xor-how--why-neurons...

The graph in slide 3 of this link helps explain it: http://www.di.unito.it/~cancelli/retineu11_12/FNN.pdf

http://www.mind.ilstu.edu/curriculum/artificial_neural_net/x...

candiodari · on Nov 15, 2017

Ah I see. My confusion comes from what is called multilayer perceptrons, which do have activation functions. Presumably that is done exactly because they don't make sense without adding those.

But that makes multilayer perceptrons different from ordinary perceptrons in more than just the multilayer part, which is very confusing.

inputcoffee · on Nov 7, 2017

When I wrote that, I was thinking of the unit step function, for which the derivative is not defined.