Alternatives to sigmoid function

Indeed it is a great question!

Since I believe you agree the main issue behind is about the output range, and not specific to awareness or any other names, I think we can rephrase the question into: why ReLU over sigmoid (or the other way around) for hidden layers.

This topic isn’t discussed in depth given the scope of the currently released course 1 & 2. I am not sure about course 3 because it is not released yet, but I did a quick scan on the deep learning specialization (DLS) and found that its course 1 week 3 Video “Activation Function” has compared sigmoid and ReLU.

I highly recommend you to watch the video yourself, but in short, comparing to ReLU, (1) Sigmoid is computationally much slower because it involves computing the exponential function, and (2) Sigmoid gives a very small gradient value (close to 0) at the far positive region which makes some parameters only have a very small update step size which results in a slow training process.

A topic that is relevant to the second point above is the “vanishing gradient problem” which is also introduced in the DLS Course 2 Week 1 Video “Vanishing / Exploding gradient”, again please watch it if you want to know more about it, but both the small gradient values and the fact that sigmoid always accept a larger range of input and produces a smaller range of output contribute to the problem, and this problem becomes significant as you grow your neural network deeper.

If DLS mentor happens to see this post, I hope they would share more insights or past discussions with us.

3 Likes