Confusion on sigmoid disadvantages in "Choosing Activation Functions" video in C2W2

YodaKenobi · December 3, 2023, 3:46am

Hello,
I was watching the video for this section and am confused on the explanation for the disadvatnages of the sigmoid relative to the ReLU, specifically on the “flat” parts of the sigmoid graph. It mentions taking the gradients are slow, and was hoping to have an example:

I think I am misunderstanding the reason why the ReLU is better in this case. Is it that:

on a flat portion of the sigmoid or ReLU graphs above, the gradients are making smaller steps
The Sigmoid has 2 near flat regions, whereas the relu has 1?

I tried lookinig back at C1W3 video “Gradient Descent Implementation” to try and understand (screenshot below)

TMosh · December 3, 2023, 3:48am

The single advantage of ReLU is that it is extremely easy to compute the gradients. They are either 0 (for negative inputs) or x (for positive inputs)

The ReLU drawback is that you get no useful gradients for any negative values, and you only get a fixed-value gradient for positive values.

This makes ReLU very inefficient. So you need to have a lot more ReLU units than if you used a single sigmoid() unit.

In comparison, sigmoid() has very useful gradients for all input values within a range of about -5 to +5 (avoiding the flat regions where the gradients approach zero. The drawback to sigmoid() is that is computationally expensive to compute.

YodaKenobi · December 3, 2023, 4:09am

Hello @TMosh , thanks for the quick reply: I’m having some trouble understanding this with respect to the the gradients and why it is faster for ReLU. Can you go inito more detail with an example with the gradients and partial derivatives? Is the following interpretation correct?

the gradients are w_j = w_j - a * partial_j_wrt_w and w_b = w_b - a * partial_j_wrt_b
The partial derivatives include f_wx(b) (second image above)
If you are on a flat portion of the sigmoid, that f_wx(b) value (and thus the partial dervative, and thus the gradient) is updating slowly?

Edited

TMosh · December 3, 2023, 4:48am

Yes, these statements are all true.

YodaKenobi · December 3, 2023, 4:23pm

For anyone else that is confused by this, I found more info by lookinig up the ‘vanishing gradients’ issue; and while at this point in the course (at that video) back propagation hasn’t been covered, googling some resources you can see that these small ‘vanishing gradients’ are an issue with teh sigmoid that aren’t for the ReLU. I Recommend checking out some youtube videos there

rmwkwok · December 3, 2023, 11:37pm

Hi @YodaKenobi, if you are still on this, as one more extended read, please also google for leaky ReLU vs. ReLU for a focus on the flat half of the latter.

Cheers.

Topic		Replies	Views
ReLU vs Sigmoid function Neural Networks and Deep Learning week-module-1 , coursera-platform	2	36	December 24, 2024
Week 1 ReLU vs mod(x) Neural Networks and Deep Learning coursera-platform	3	682	February 15, 2022
Week 1 - importance/advantage of ReLU Neural Networks and Deep Learning coursera-platform	1	529	September 30, 2021
ReLu activation function Vs sigmoid function Neural Networks and Deep Learning coursera-platform	2	557	June 15, 2022
DL and NN course1 Week#3: Understanding Activation functions Neural Networks and Deep Learning week-module-3 , coursera-platform	2	32	March 4, 2025

Confusion on sigmoid disadvantages in "Choosing Activation Functions" video in C2W2

Related topics