Activation Function Intuittion Question

I just want to verify the my intuition of why activation functions are necessary. For this example lets consider a network that classifies numbers 0-9. A network WITHOUT an activation function will be able to do well on numbers that are similar to the sizes of the numbers in the training set but it will struggle if numbers appear to be darker or lighter because it is linear and cannot take both size and lightness/darkness into account. In a neural network completeing the same task but WITH activation functions will be able to take into account both orientation and lightness/darkness because the weights will learn all possible relationships between the pixel values in the data set and the sigmoid(or other acctivation function) will then take number that are slighty lighter or darker and transform/smooth them so that the network could ouput the same probability as if it were the nomral darkness/lightness. Does this intuition sound about right or is it incorrect in some way ?

1 Like

I think you’re reading too much into the role of an activation function. But at this point in the MLS Course 1, it hasn’t discussed much farther than simple regression.

Activations don’t have much to do with learning about orientation or image contrast. That is the more about the role of using hidden layers in an neural network.

A hidden layer must have an activation function that is non-linear (such as sigmoid, tanh, or ReLU). The non-linear function is critical to how a NN learns non-linear combinations of the input features.

At an output layer, most often the activation function used is sigmoid() because it re-scales a floating point number into a range of 0 to 1. This helps if you want to consider an output as a probability.

1 Like

I understand it is a non linearity but mabey let me try and explain what I am thinking in a diffrent way. In a fully connected (Not Convolutional) network that is detecting numbers 0-9 the network has learned for example all the relationships between the pixels of a 7 of standard darkness in all sorts of positions. Then during test time a lighter 7 comes along with slighlty lighter pixel values. In one of the neurons in the input layer after the lighter 7’s input has been multiplied by the weights and biases for that neuron when it is transformed by the sigmoid function it will have approximately the same value the darker one after passing through sigmoid in neuron one would be 0.994 and the lighter one 0.992 so then the lighter one will get treated the same through all later layer of the network and get classified correctly.

While on the other hand if your using a convolutional network you can use relu to focus on relevant connections and the max pooling layer handle the different brightness and lightness of the images. Does this intuition make sense ?

1 Like

@Girijesh @paulinpaloalto Sorry to bother you guys but is this intuition on the right track ?

1 Like

There are a number of layers of issues here. I don’t think we concretely know at an individual neuron level how the network manages to recognize similar shapes but with different colors. If I’m remembering the history right, you’ve already taken DLS C4 and C5, so you are familiar with the work that Prof Ng describes in the DLS C4 W4 lecture “What are Deep ConvNets Learning?” That’s the best explanation I’ve seen anywhere of the intuitions for what is going in the internal layers of a neural network. But I have not taken the trouble to see if there is similar work w.r.t. Fully Connected nets.

On the question of non-linearity, it sounds like you’ve already understood that. The fundamental point is that the composition of linear functions is still linear. That’s an easily provable theorem. What that means is what Tom said earlier on this thread: if you don’t include the non-linear activation at every level in the network, then there is literally no point in having multiple layers: you can’t express a more complex function than a linear function. So that would mean that every Fully Connected network would be exactly equivalent to Logistic Regression. Once you make each layer non-linear, then you’re in business: composing non-linear functions gets you “more” non-linearity. If you compose non-linear functions then they get really really non-linear. Of course I’m making a math nerd joke there, but you can actually talk about degrees of non-linearity in polynomial functions. If f(z) = z^3 and g(z) = 2z^2 + 42, then the function h(z) = g(f(z)) gives you a polynomial of degree 6, right? So that is a simple way to see the point about how non-linearity builds in a neural network by the addition of non-linear layers. The more layers, the more complex a function you can learn with very complex decision boundaries. (Update: mind you I’m not saying that neural network layers are necessarily polynomial functions. I’m just giving a mathematical example to show how composition of non-linear functions increases the complexity of the function you can express.)

Then you have the other question of the properties of activations functions beyond their simple non-linearity. ReLU is the “minimalist” activation: it’s very cheap to compute and it’s non-linear. But it has the “dead neuron” problem for all negative inputs, so it doesn’t always work. If it doesn’t, then you try Leaky ReLU, which is also cheap and doesn’t have the dead neuron problem. If that doesn’t work, then you graduate to the more expensive and complex functions like tanh and sigmoid, which are based on exponentials and have different properties (tanh is symmetric about 0, but sigmoid is not and both of them have “flat tails” for large values of |z|). How any of those choices play out at the level of the intuition you are expressing above, I don’t claim to know, but maybe we can do a bit of searching and see if anyone has written a paper that applies the DLS C4 W4 lecture techniques to other types of nets (FC or RNN).

2 Likes

Thanks this clears a lot of the confusion up !

1 Like

Note that ReLU is just another activation function . It doesn’t have anything specific to do with convolutional NNs.

ReLU has advantages and disadvantages:

  • Plus: requires very little computation.
  • Minus: has zero gradients for any negative inputs - so it isn’t very efficient at learning negative weights. This means you need a lot more ReLU units to form a complete solution, and hope that some of them will cover the negative weights.
2 Likes

I don’t think the “range compression” you’re discussing about the limiting output values of sigmoid() are a significant factor in compensating for image intensity differences. That’s not why we use sigmoid.

Compensating for dim images might be something another hidden layer would learn to fix - although this is more of a feature normalization or pre-processing task. It’s not fixable by selecting a specific activation function. That’s not what activations do.

1 Like

makes perfect sense thanks for the help T !

1 Like