When enumerating the list of hyperparameters in week 3, no reference was made to the selection of different activation functions. Is it because there are only a handful of options or because ReLU is the best alternative in general or is there any other reason? In any case, how important is the activation function compared to tuning the learning rate for example? Thank you.
Thanks for pointing this out. That is an omission! I did a quick scan of the transcripts in those lectures in Week 3. I think the reason is that Prof Ng is concentrating more on the hyperparameters that have numerical ranges and the strategies for handling those choices. The activation function selection doesn’t really fit that framework.
The choice of activation function for the hidden layers is an important hyperparameter. There are lots of choices. A common practice is to start with ReLU, since it is by far the cheapest to compute. You can view it as the “minimalist” activation function: just the minimal amount of non-linearity and dirt cheap to compute. If it works, that’s great. But it is definitely the case that ReLU does not always work. Then you try Leaky ReLU: it’s almost as cheap to compute, but does not have the vanishing gradient (aka “dead neuron”) problem for Z < 0. If that doesn’t work, then you graduate to more expensive and sophisticated functions like tanh, swish or sigmoid.
In terms of the relative importance of the choice of activation versus learning rate, I don’t know a definitive answer, but would say that it’s probably not worth worrying about. The reason is that pretty soon we will graduate to using TensorFlow for everything and it uses more sophisticated gradient descent algorithms internally that do not require you to select a fixed learning rate. In other words, pretty soon the learning rate will cease to be a knob you need to turn.