Why do we need Linear Activation functions?

The processing at each layer of a fully connected “feed forward” neural net consists of two steps:

  1. You take the linear combination of the weights for each neuron in that layer with the input vector (the output of all neurons from the previous layer) and then you add a bias value.

  2. You then apply an “elementwise” non-linear activation function to all the values for the individual neurons computed in step 1).

Step 1 is referred to as the “linear activation” and step 2 is (obviously) the non-linear activation.

The reason for step 1 is that it is a very simple and mathematically well behaved way to allow each neuron to compute an output value based on each of the inputs it is getting from the previous layer. Each neuron in the given layer has its own specific (learned) weight and bias values.

The reason for step 2 is that we must have non-linearity at every layer of the network or there is literally no point in having multiple layers. That is because it’s an easily provable theorem that the composition of linear functions is still linear. As you say, it is typical that the non-linear activation functions convert their inputs into values in particular range (e.g. sigmoid and tanh), but that is not actually required. There are other commonly used activation functions like ReLU, Leaky ReLU and swish which have infinitely large ranges.

The important thing to realize is that the non-linear activation functions are applied “elementwise” as I described above. Meaning they take the output value of each neuron individually and compute one output per neuron. So if you don’t have step 1, then how can the neuron handle the fact that it is getting the output of every neuron from the previous layer? That is the key architectural feature of fully connected feed forward networks, right? That’s what “fully connected” means. That’s the point.

1 Like