What is the rationale of using ReLU and softmax and not other activation functions?
Those are two different cases, right? You only use softmax at the output layer when you have a multiclass classification problem. For classifiers, there is essentially no choice for the output layer activation: you use sigmoid for binary classifiers and softmax for multiclass classifiers. You’ll notice that the cross entropy loss function works for both. You can legitimately consider softmax to be the multiclass generalization of sigmoid.
For the hidden layers of a NN, it is a different case: here you can use any of a wide range of activation functions: ReLU, Leaky ReLU, sigmoid, tanh, swish and others. The reason they frequently use ReLU is that you can think of it as the “minimalist” hidden layer activation: it is dirt cheap to compute and it provides just one point of non-linearity which turns out to be enough. From a mathematical standpoint, there is no such thing as “almost linear”, right? It’s either linear or it’s not. Of course, it also has some limitations: it has the “dead neuron” or “vanishing gradient” problem just about as bad as you can imagine, for all z < 0. So the normal approach is to always start by trying ReLU. It either works or it doesn’t. If it doesn’t, then next you try Leaky ReLU, which is almost as cheap to compute as ReLU and fixes the “dead neuron” problem. If that also doesn’t work, only then do you consider the more computationally expensive functions like tanh, sigmoid and swish. The choice of hidden layer activation is yet another “hyperparameter”, meaning a decision you as the system designer need to make.