Hello,
I was watching the video for this section and am confused on the explanation for the disadvatnages of the sigmoid relative to the ReLU, specifically on the “flat” parts of the sigmoid graph. It mentions taking the gradients are slow, and was hoping to have an example:
The single advantage of ReLU is that it is extremely easy to compute the gradients. They are either 0 (for negative inputs) or x (for positive inputs)
The ReLU drawback is that you get no useful gradients for any negative values, and you only get a fixed-value gradient for positive values.
This makes ReLU very inefficient. So you need to have a lot more ReLU units than if you used a single sigmoid() unit.
In comparison, sigmoid() has very useful gradients for all input values within a range of about -5 to +5 (avoiding the flat regions where the gradients approach zero. The drawback to sigmoid() is that is computationally expensive to compute.
Hello @TMosh , thanks for the quick reply: I’m having some trouble understanding this with respect to the the gradients and why it is faster for ReLU. Can you go inito more detail with an example with the gradients and partial derivatives? Is the following interpretation correct?
the gradients are w_j = w_j - a * partial_j_wrt_w and w_b = w_b - a * partial_j_wrt_b
The partial derivatives include f_wx(b) (second image above)
If you are on a flat portion of the sigmoid, that f_wx(b) value (and thus the partial dervative, and thus the gradient) is updating slowly?
For anyone else that is confused by this, I found more info by lookinig up the ‘vanishing gradients’ issue; and while at this point in the course (at that video) back propagation hasn’t been covered, googling some resources you can see that these small ‘vanishing gradients’ are an issue with teh sigmoid that aren’t for the ReLU. I Recommend checking out some youtube videos there
Hi @YodaKenobi, if you are still on this, as one more extended read, please also google for leaky ReLU vs. ReLU for a focus on the flat half of the latter.