I just want to verify the my intuition of why activation functions are necessary. For this example lets consider a network that classifies numbers 0-9. A network WITHOUT an activation function will be able to do well on numbers that are similar to the sizes of the numbers in the training set but it will struggle if numbers appear to be darker or lighter because it is linear and cannot take both size and lightness/darkness into account. In a neural network completeing the same task but WITH activation functions will be able to take into account both orientation and lightness/darkness because the weights will learn all possible relationships between the pixel values in the data set and the sigmoid(or other acctivation function) will then take number that are slighty lighter or darker and transform/smooth them so that the network could ouput the same probability as if it were the nomral darkness/lightness. Does this intuition sound about right or is it incorrect in some way ?

I think youâ€™re reading too much into the role of an activation function. But at this point in the MLS Course 1, it hasnâ€™t discussed much farther than simple regression.

Activations donâ€™t have much to do with learning about orientation or image contrast. That is the more about the role of using hidden layers in an neural network.

A hidden layer must have an activation function that is non-linear (such as sigmoid, tanh, or ReLU). The non-linear function is critical to how a NN learns non-linear combinations of the input features.

At an output layer, most often the activation function used is sigmoid() because it re-scales a floating point number into a range of 0 to 1. This helps if you want to consider an output as a probability.

I understand it is a non linearity but mabey let me try and explain what I am thinking in a diffrent way. In a fully connected (Not Convolutional) network that is detecting numbers 0-9 the network has learned for example all the relationships between the pixels of a 7 of standard darkness in all sorts of positions. Then during test time a lighter 7 comes along with slighlty lighter pixel values. In one of the neurons in the input layer after the lighter 7â€™s input has been multiplied by the weights and biases for that neuron when it is transformed by the sigmoid function it will have approximately the same value the darker one after passing through sigmoid in neuron one would be 0.994 and the lighter one 0.992 so then the lighter one will get treated the same through all later layer of the network and get classified correctly.

While on the other hand if your using a convolutional network you can use relu to focus on relevant connections and the max pooling layer handle the different brightness and lightness of the images. Does this intuition make sense ?

@Girijesh @paulinpaloalto Sorry to bother you guys but is this intuition on the right track ?

There are a number of layers of issues here. I donâ€™t think we concretely know at an individual neuron level how the network manages to recognize similar shapes but with different colors. If Iâ€™m remembering the history right, youâ€™ve already taken DLS C4 and C5, so you are familiar with the work that Prof Ng describes in the DLS C4 W4 lecture â€śWhat are Deep ConvNets Learning?â€ť Thatâ€™s the best explanation Iâ€™ve seen anywhere of the intuitions for what is going in the internal layers of a neural network. But I have not taken the trouble to see if there is similar work w.r.t. Fully Connected nets.

On the question of non-linearity, it sounds like youâ€™ve already understood that. The fundamental point is that the composition of linear functions is still linear. Thatâ€™s an easily provable theorem. What that means is what Tom said earlier on this thread: if you donâ€™t include the non-linear activation at every level in the network, then there is literally no point in having multiple layers: you canâ€™t express a more complex function than a linear function. So that would mean that every Fully Connected network would be exactly equivalent to Logistic Regression. Once you make each layer non-linear, then youâ€™re in business: composing non-linear functions gets you â€śmoreâ€ť non-linearity. If you compose non-linear functions then they get really really non-linear. Of course Iâ€™m making a math nerd joke there, but you can actually talk about degrees of non-linearity in polynomial functions. If f(z) = z^3 and g(z) = 2z^2 + 42, then the function h(z) = g(f(z)) gives you a polynomial of degree 6, right? So that is a simple way to see the point about how non-linearity builds in a neural network by the addition of non-linear layers. The more layers, the more complex a function you can learn with very complex decision boundaries. *(Update: mind you Iâ€™m not saying that neural network layers are necessarily polynomial functions. Iâ€™m just giving a mathematical example to show how composition of non-linear functions increases the complexity of the function you can express.)*

Then you have the other question of the properties of activations functions beyond their simple non-linearity. ReLU is the â€śminimalistâ€ť activation: itâ€™s very cheap to compute and itâ€™s non-linear. But it has the â€śdead neuronâ€ť problem for all negative inputs, so it doesnâ€™t always work. If it doesnâ€™t, then you try Leaky ReLU, which is also cheap and doesnâ€™t have the dead neuron problem. If that doesnâ€™t work, then you graduate to the more expensive and complex functions like tanh and sigmoid, which are based on exponentials and have different properties (tanh is symmetric about 0, but sigmoid is not and both of them have â€śflat tailsâ€ť for large values of |z|). How any of those choices play out at the level of the intuition you are expressing above, I donâ€™t claim to know, but maybe we can do a bit of searching and see if anyone has written a paper that applies the DLS C4 W4 lecture techniques to other types of nets (FC or RNN).

Thanks this clears a lot of the confusion up !

Note that ReLU is just another activation function . It doesnâ€™t have anything specific to do with convolutional NNs.

ReLU has advantages and disadvantages:

- Plus: requires very little computation.
- Minus: has zero gradients for any negative inputs - so it isnâ€™t very efficient at learning negative weights. This means you need a lot more ReLU units to form a complete solution, and hope that some of them will cover the negative weights.

I donâ€™t think the â€śrange compressionâ€ť youâ€™re discussing about the limiting output values of sigmoid() are a significant factor in compensating for image intensity differences. Thatâ€™s not why we use sigmoid.

Compensating for dim images might be something another hidden layer would learn to fix - although this is more of a feature normalization or pre-processing task. Itâ€™s not fixable by selecting a specific activation function. Thatâ€™s not what activations do.

makes perfect sense thanks for the help T !