Week3 - Choice of Activation function

Yes, the choice of hidden layer activations is one of the “hyperparameters”, meaning choices that you need to make. As you say, the output layer is fixed: sigmoid for binary classifications and softmax for multiclass classifications (we haven’t learned about softmax yet, but we will in Course 2). But for the hidden layers, you have quite a few choices. What you will see in this and the subsequent courses is that Prof Ng normally uses ReLU for the hidden layer activations, although he uses tanh here in Week 3. You can think of ReLU as the “minimalist” activation function: it’s dirt cheap to compute and provides just the minimum required amount of non-linearity. But it has some limitations as well: it has the “dead neuron” or “vanishing gradient” problem for all z < 0, so it may not work well in all cases. But it seems to work remarkably well in lots of cases. So it looks like there is a natural order in which you try the possible hidden layer activation functions: start with ReLU, if that doesn’t work well then try Leaky ReLU, which is almost as cheap to compute and eliminates the “dead neuron” problem. With Leaky ReLU you also can try different values of the slope for negative values. If that doesn’t work, then you try the more expensive functions like tanh, sigmoid, swish or other possibilities.