Vanishing_exploding gradients

abdou_brk · September 11, 2022, 3:17pm

Hello
When prof andrew expained the concept of vanishing/exploding gradients he asumed that we’re using a linear activation func g(z)=z and thus we end up with very small/large activation values because we have a(l)=W(l) W(l-1)…W(2)W(1)X and therefore the weights become too small or too large because they have to adjust to very small or large activation values . First i want to know whether what i said is true or not.Second , i think ( asuming what i said is true) in the case we don’t use linear activation func we will not have this problem because we don’t end up multiplying wheights like this : a(l)=W(l) W(l-1)…W(2)W(1)X

Elemento · September 12, 2022, 1:18pm

Hey @abdou_brk,
The part where Prof Andrew uses a linear activation function is true. The rest of your argument seems to be a bit trembling to me, but never mind, let us try to see how we can state this better. Here, I am assuming we are referring to the lecture video entitled “Vanishing / Exploding Gradients”.

First of all, our area of interest here is “very small/large gradients” and not “very small/large activation values”. Second, in the video, Prof Andrew doesn’t want the learners to focus on how the weights are calculated, instead, he want the learners to focus on the fact that what will happen if the weights are larger/smaller than 1 in this particular example, and carry those insights to other examples.

This part is correct, and if we assume the loss function to be say MSE, you will find that the gradients will contain the product of weights, and hence, larger than 1 weights could lead to exploding gradients and small than 1 weights could lead to vanishing gradients, as depicted in the video. I hope this resolves your first query.

As to your second query, let me present 2 counter-examples to you. First, try to think what will happen in the case of a sigmoid/tanh activation function with extremely large (positive or negative) inputs. You will find the issue of vanishing gradients here. Second, try to think what will happen in the case of a relu activation function with only extremely large positive inputs. In that case, you will find the form of gradients to be the same as above, which could lead to exploding gradients. I hope this resolves your second query.

Cheers,
Elemento

Topic		Replies	Views
Vanishing/Exploding Activations Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	574	October 31, 2021
Question on weight initialization and exploding/vanishing gradients Improving Deep Neural Networks: Hyperparameter tun coursera-platform	9	673	May 23, 2021
Vanishing/Exploding Gradients when there is a non-linear activation function Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	647	January 13, 2023
What causing exploding gradients? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	591	April 21, 2022
Vanishing / Exploding Gradients Improving Deep Neural Networks: Hyperparameter tun week-1 , coursera-platform	5	480	January 11, 2024

Vanishing_exploding gradients

Related topics