Why ReLU and softmax?

Michael_L · November 2, 2021, 1:57am

What is the rationale of using ReLU and softmax and not other activation functions?

paulinpaloalto · November 2, 2021, 5:12pm

Those are two different cases, right? You only use softmax at the output layer when you have a multiclass classification problem. For classifiers, there is essentially no choice for the output layer activation: you use sigmoid for binary classifiers and softmax for multiclass classifiers. You’ll notice that the cross entropy loss function works for both. You can legitimately consider softmax to be the multiclass generalization of sigmoid.

For the hidden layers of a NN, it is a different case: here you can use any of a wide range of activation functions: ReLU, Leaky ReLU, sigmoid, tanh, swish and others. The reason they frequently use ReLU is that you can think of it as the “minimalist” hidden layer activation: it is dirt cheap to compute and it provides just one point of non-linearity which turns out to be enough. From a mathematical standpoint, there is no such thing as “almost linear”, right? It’s either linear or it’s not. Of course, it also has some limitations: it has the “dead neuron” or “vanishing gradient” problem just about as bad as you can imagine, for all z < 0. So the normal approach is to always start by trying ReLU. It either works or it doesn’t. If it doesn’t, then next you try Leaky ReLU, which is almost as cheap to compute as ReLU and fixes the “dead neuron” problem. If that also doesn’t work, only then do you consider the more computationally expensive functions like tanh, sigmoid and swish. The choice of hidden layer activation is yet another “hyperparameter”, meaning a decision you as the system designer need to make.

Topic		Replies	Views
Week3 - Choice of Activation function Neural Networks and Deep Learning	2	753	February 5, 2022
Why softmax in last layer for multiclass NN? Improving Deep Neural Networks: Hyperparameter tun	5	563	January 7, 2022
Building an intuition for which activation function makes sense to a problem Neural Networks and Deep Learning	2	617	August 3, 2022
C2_W3_multiclassification Improving Deep Neural Networks: Hyperparameter tun	3	516	September 5, 2022
Softmax layer at last layer Neural Networks and Deep Learning	1	525	April 15, 2022

Why ReLU and softmax?

Related topics