What is the rationale of using ReLU and softmax and not other activation functions?

Those are two different cases, right? You only use *softmax* at the output layer when you have a multiclass classification problem. For classifiers, there is essentially no choice for the *output layer* activation: you use *sigmoid* for binary classifiers and *softmax* for multiclass classifiers. Youâ€™ll notice that the cross entropy loss function works for both. You can legitimately consider *softmax* to be the multiclass generalization of *sigmoid*.

For the hidden layers of a NN, it is a different case: here you can use any of a wide range of activation functions: ReLU, Leaky ReLU, *sigmoid*, *tanh*, *swish* and others. The reason they frequently use ReLU is that you can think of it as the â€śminimalistâ€ť hidden layer activation: it is dirt cheap to compute and it provides just one point of non-linearity which turns out to be enough. From a mathematical standpoint, there is no such thing as â€śalmost linearâ€ť, right? Itâ€™s either linear or itâ€™s not. Of course, it also has some limitations: it has the â€śdead neuronâ€ť or â€śvanishing gradientâ€ť problem just about as bad as you can imagine, for all z < 0. So the normal approach is to always start by trying ReLU. It either works or it doesnâ€™t. If it doesnâ€™t, then next you try Leaky ReLU, which is almost as cheap to compute as ReLU and fixes the â€śdead neuronâ€ť problem. If that also doesnâ€™t work, only then do you consider the more computationally expensive functions like *tanh*, *sigmoid* and *swish*. The choice of hidden layer activation is yet another â€śhyperparameterâ€ť, meaning a decision you as the system designer need to make.