Confusion on sigmoid disadvantages in "Choosing Activation Functions" video in C2W2

Hello,
I was watching the video for this section and am confused on the explanation for the disadvatnages of the sigmoid relative to the ReLU, specifically on the “flat” parts of the sigmoid graph. It mentions taking the gradients are slow, and was hoping to have an example:

I think I am misunderstanding the reason why the ReLU is better in this case. Is it that:

  1. on a flat portion of the sigmoid or ReLU graphs above, the gradients are making smaller steps
  2. The Sigmoid has 2 near flat regions, whereas the relu has 1?

I tried lookinig back at C1W3 video “Gradient Descent Implementation” to try and understand (screenshot below)

1 Like

The single advantage of ReLU is that it is extremely easy to compute the gradients. They are either 0 (for negative inputs) or x (for positive inputs)

The ReLU drawback is that you get no useful gradients for any negative values, and you only get a fixed-value gradient for positive values.

This makes ReLU very inefficient. So you need to have a lot more ReLU units than if you used a single sigmoid() unit.

In comparison, sigmoid() has very useful gradients for all input values within a range of about -5 to +5 (avoiding the flat regions where the gradients approach zero. The drawback to sigmoid() is that is computationally expensive to compute.

1 Like

Hello @TMosh , thanks for the quick reply: I’m having some trouble understanding this with respect to the the gradients and why it is faster for ReLU. Can you go inito more detail with an example with the gradients and partial derivatives? Is the following interpretation correct?

  1. the gradients are w_j = w_j - a * partial_j_wrt_w and w_b = w_b - a * partial_j_wrt_b
  2. The partial derivatives include f_wx(b) (second image above)
  3. If you are on a flat portion of the sigmoid, that f_wx(b) value (and thus the partial dervative, and thus the gradient) is updating slowly?

Edited

1 Like

Yes, these statements are all true.

2 Likes

For anyone else that is confused by this, I found more info by lookinig up the ‘vanishing gradients’ issue; and while at this point in the course (at that video) back propagation hasn’t been covered, googling some resources you can see that these small ‘vanishing gradients’ are an issue with teh sigmoid that aren’t for the ReLU. I Recommend checking out some youtube videos there

1 Like

Hi @YodaKenobi, if you are still on this, as one more extended read, please also google for leaky ReLU vs. ReLU for a focus on the flat half of the latter.

Cheers.

1 Like