Vanishing_exploding gradients

Hello
When prof andrew expained the concept of vanishing/exploding gradients he asumed that we’re using a linear activation func g(z)=z and thus we end up with very small/large activation values because we have a(l)=W(l) W(l-1)…W(2)W(1)X and therefore the weights become too small or too large because they have to adjust to very small or large activation values . First i want to know whether what i said is true or not.Second , i think ( asuming what i said is true) in the case we don’t use linear activation func we will not have this problem because we don’t end up multiplying wheights like this : a(l)=W(l) W(l-1)…W(2)W(1)X

Hey @abdou_brk,
The part where Prof Andrew uses a linear activation function is true. The rest of your argument seems to be a bit trembling to me, but never mind, let us try to see how we can state this better. Here, I am assuming we are referring to the lecture video entitled “Vanishing / Exploding Gradients”.

First of all, our area of interest here is “very small/large gradients” and not “very small/large activation values”. Second, in the video, Prof Andrew doesn’t want the learners to focus on how the weights are calculated, instead, he want the learners to focus on the fact that what will happen if the weights are larger/smaller than 1 in this particular example, and carry those insights to other examples.

This part is correct, and if we assume the loss function to be say MSE, you will find that the gradients will contain the product of weights, and hence, larger than 1 weights could lead to exploding gradients and small than 1 weights could lead to vanishing gradients, as depicted in the video. I hope this resolves your first query.

As to your second query, let me present 2 counter-examples to you. First, try to think what will happen in the case of a sigmoid/tanh activation function with extremely large (positive or negative) inputs. You will find the issue of vanishing gradients here. Second, try to think what will happen in the case of a relu activation function with only extremely large positive inputs. In that case, you will find the form of gradients to be the same as above, which could lead to exploding gradients. I hope this resolves your second query.

Cheers,
Elemento

1 Like