Week3 - Choice of Activation function

In Week3, prof discusses different activation functions such as tanh, ReLU, Leaky ReLu for hidden layers. I understand the use of sigmoid for the output layer for binary classification. Otherwise, how do you determine which function to use? Do you try out different options and see what fits the best?


Yes, the choice of hidden layer activations is one of the “hyperparameters”, meaning choices that you need to make. As you say, the output layer is fixed: sigmoid for binary classifications and softmax for multiclass classifications (we haven’t learned about softmax yet, but we will in Course 2). But for the hidden layers, you have quite a few choices. What you will see in this and the subsequent courses is that Prof Ng normally uses ReLU for the hidden layer activations, although he uses tanh here in Week 3. You can think of ReLU as the “minimalist” activation function: it’s dirt cheap to compute and provides just the minimum required amount of non-linearity. But it has some limitations as well: it has the “dead neuron” or “vanishing gradient” problem for all z < 0, so it may not work well in all cases. But it seems to work remarkably well in lots of cases. So it looks like there is a natural order in which you try the possible hidden layer activation functions: start with ReLU, if that doesn’t work well then try Leaky ReLU, which is almost as cheap to compute and eliminates the “dead neuron” problem. With Leaky ReLU you also can try different values of the slope for negative values. If that doesn’t work, then you try the more expensive functions like tanh, sigmoid, swish or other possibilities.


Perfect, thank you very much for the detailed explanation.