Week3 - Choice of Activation function

paulinpaloalto · February 5, 2022, 2:06am

Yes, the choice of hidden layer activations is one of the “hyperparameters”, meaning choices that you need to make. As you say, the output layer is fixed: sigmoid for binary classifications and softmax for multiclass classifications (we haven’t learned about softmax yet, but we will in Course 2). But for the hidden layers, you have quite a few choices. What you will see in this and the subsequent courses is that Prof Ng normally uses ReLU for the hidden layer activations, although he uses tanh here in Week 3. You can think of ReLU as the “minimalist” activation function: it’s dirt cheap to compute and provides just the minimum required amount of non-linearity. But it has some limitations as well: it has the “dead neuron” or “vanishing gradient” problem for all z < 0, so it may not work well in all cases. But it seems to work remarkably well in lots of cases. So it looks like there is a natural order in which you try the possible hidden layer activation functions: start with ReLU, if that doesn’t work well then try Leaky ReLU, which is almost as cheap to compute and eliminates the “dead neuron” problem. With Leaky ReLU you also can try different values of the slope for negative values. If that doesn’t work, then you try the more expensive functions like tanh, sigmoid, swish or other possibilities.

Topic		Replies	Views
Why ReLU and softmax? NLP with Probabilistic Models week-4	1	603	November 2, 2021
About activation functions Neural Networks and Deep Learning	2	664	August 9, 2022
Using different activation function for hidden layers Neural Networks and Deep Learning	4	1647	February 7, 2022
Activation functions as hyperparameters Improving Deep Neural Networks: Hyperparameter tun	1	551	September 14, 2021
Course1 - Week3 Assignment - ReLU gave worse performance than tanh Neural Networks and Deep Learning	3	547	September 9, 2021

Week3 - Choice of Activation function

Related topics