So activation function is used to map the output of linear function into some other range. That I learnt in the first course in logistic regression. But in term of activation function why it is crucial?
I mean why does having activation makes sense. I could just add sigmoid or other action function in the output layer because there I want to map the value in certain range. What “meaning” does adding activation in the hidden layer give to the next layer.
I hope you understand the question.
What basically the activation does is to change a linear function into non-linear function i.e. you change its behavior (you dent it, break it, you shape it) in such way that linear function is impossible to do, and in much simpler than joining polynomials like in SVM.
If you just add sigmoid to the end only you only get logistic regression no matter how many linear layers you have. You only get a dent in the end
3 Likes
Right! To state Gent’s point in another way, there is an easily provable theorem that the composition of linear functions is still a linear function. That’s the mathematical way to say that if you feed the output of a linear function into another linear function, the combined result is still a linear function, just with different coefficients. In other words, you don’t get any more complex a function by “stacking” linear layers in a network. You need the non-linear activation function at every layer of the network precisely because the whole point of multiple cascading layers in a Neural Network is that you want to create a more and more complex function. With the addition of non-linearity, the more layers, the more complexity you can get.
At the output layer, you specifically need sigmoid as the activation, because it converts the output into something that you can interpret as the probability that the answer is “yes” for a binary classifier. For the hidden layers you have lots of choices. ReLU is one that is very commonly used. You can think of that as the “minimalist” activation function: it’s dirt cheap to compute since it’s just a “high pass filter”. It provides the bare minimum of non-linearity: it’s piecewise linear with a break at z = 0. But it also has the “dead neuron” problem for all inputs < 0 by definition, so it doesn’t always work. If not, you can try Leaky ReLU, tanh, swish or sigmoid. Prof Ng will discuss this in more detail as you proceed through the various courses and specializations. You may need to wait until you get to DLS for the full explanation.
5 Likes
Your explainations are always very detailed and informative Paul, thank you.
So activation functions are just transformation functions. 0 means that neuron should not contribute in making decisions in top layers (bottom to top layout).
I get it, the whole point of using neural nets over regular machine learning is it helps in fitting on the non linear data (complex data). So as Gent said, we need to beat the linear function to have a bent shape as per the shape of the data.