In Week3, prof discusses different activation functions such as tanh, ReLU, Leaky ReLu for hidden layers. I understand the use of sigmoid for the output layer for binary classification. Otherwise, how do you determine which function to use? Do you try out different options and see what fits the best?

Yes, the choice of hidden layer activations is one of the â€śhyperparametersâ€ť, meaning choices that you need to make. As you say, the output layer is fixed: sigmoid for binary classifications and softmax for multiclass classifications (we havenâ€™t learned about softmax yet, but we will in Course 2). But for the hidden layers, you have quite a few choices. What you will see in this and the subsequent courses is that Prof Ng normally uses ReLU for the hidden layer activations, although he uses tanh here in Week 3. You can think of ReLU as the â€śminimalistâ€ť activation function: itâ€™s dirt cheap to compute and provides just the minimum required amount of non-linearity. But it has some limitations as well: it has the â€śdead neuronâ€ť or â€śvanishing gradientâ€ť problem for all z < 0, so it may not work well in all cases. But it seems to work remarkably well in lots of cases. So it looks like there is a natural order in which you try the possible hidden layer activation functions: start with ReLU, if that doesnâ€™t work well then try Leaky ReLU, which is almost as cheap to compute and eliminates the â€śdead neuronâ€ť problem. With Leaky ReLU you also can try different values of the slope for negative values. If that doesnâ€™t work, then you try the more expensive functions like tanh, sigmoid, swish or other possibilities.