W1 assignment_Initialization

I undestand that we are using he initialization to take care of the vanishing or exploding gradients.
But I think to take care of exploding gradients I can just multiply by 0.01 also , then the problem of vanishing gradient is solved, but vanishing still remains .
Can we take care of that by increasing the no of iterations .
I know this takes much of computational cost , but if we neglect this , Can this be an solution to vanishing gradients

This is what is happening

And also I understand that Vanishing gradients gives us a problem where gradient descent is slow .
Tell me if I am correct with the understanding of Gradient exploding :-
So , as we update parameters in gradient descent , if we have a problem of gradient exploding then the value of w is just fluctuating very fast as our gradient is too big

@Amit_Shukla Can you help me with this??

Here are some thoughts (though I have not reviewed this course recently):

  • “he initialization” isn’t specifically about vanishing gradients. It’s a method for making ReLU units work more effectively. This is because ReLU units have zero gradients for all negative values.

  • Exploding gradients have two causes; Feature values which have high variance (for which the fix is normalizing the features), or using a learning rate that is too large.

Hi @Kamal_Nayan , Thanks for your question

The problem of vainishing and exploding gradients are caused due to your learning rate. Increasing number of iterations would not help if your learning rate is either too small or too big. Let’s try to understand both of them separately.

Vanishing Gradient occurs generally when your model reaches some local optimum instead of reaching the global optimum due to very small updates in parameters. If you have some basic knowledge of calculus, you would know that at optimums, the differentation of the function goes near to 0. In such a case if you are learning rate is also very small, then it would just catalyse the effect and make updates even smaller or we can say, almost negligible update. In such case, your parameters get stuck at the same place as they are not changing at all and thus increasing number of iterations of training would not help at all.

Exploding gradient occurs when your learning rate is too high. In such cases, the updates are way too high and cause radical changes in values of parameters and can cause shifts the nature of training frequently. Since when we approach towards optimums, our speed of updating parameters should slow down but because of high learning rate, we do not have any breaks at all ! This will result in high swings in parameter values leading to divergence and thus again increasing number of iterations is only going to worsen things due to increasing computational cost.

Thus its important to understand that choosing your learning rate, normalisationa and scaling of data are very important in case required. Generally recommended values lie in the range of 0.01 to 1 but this is subject to your problem statement. I hope I have answered your question.