W1 assignment_Initialization

Kamal_Nayan · July 16, 2023, 7:58am

I undestand that we are using he initialization to take care of the vanishing or exploding gradients.
But I think to take care of exploding gradients I can just multiply by 0.01 also , then the problem of vanishing gradient is solved, but vanishing still remains .
Can we take care of that by increasing the no of iterations .
I know this takes much of computational cost , but if we neglect this , Can this be an solution to vanishing gradients

Kamal_Nayan · July 16, 2023, 7:59am

This is what is happening

Kamal_Nayan · July 16, 2023, 8:03am

And also I understand that Vanishing gradients gives us a problem where gradient descent is slow .
Tell me if I am correct with the understanding of Gradient exploding :-
So , as we update parameters in gradient descent , if we have a problem of gradient exploding then the value of w is just fluctuating very fast as our gradient is too big

Kamal_Nayan · July 16, 2023, 12:12pm

@Amit_Shukla Can you help me with this??

TMosh · July 16, 2023, 3:00pm

Here are some thoughts (though I have not reviewed this course recently):

“he initialization” isn’t specifically about vanishing gradients. It’s a method for making ReLU units work more effectively. This is because ReLU units have zero gradients for all negative values.
Exploding gradients have two causes; Feature values which have high variance (for which the fix is normalizing the features), or using a learning rate that is too large.

Amit_Shukla · July 24, 2023, 11:18pm

Hi @Kamal_Nayan , Thanks for your question

The problem of vainishing and exploding gradients are caused due to your learning rate. Increasing number of iterations would not help if your learning rate is either too small or too big. Let’s try to understand both of them separately.

Vanishing Gradient occurs generally when your model reaches some local optimum instead of reaching the global optimum due to very small updates in parameters. If you have some basic knowledge of calculus, you would know that at optimums, the differentation of the function goes near to 0. In such a case if you are learning rate is also very small, then it would just catalyse the effect and make updates even smaller or we can say, almost negligible update. In such case, your parameters get stuck at the same place as they are not changing at all and thus increasing number of iterations of training would not help at all.

Exploding gradient occurs when your learning rate is too high. In such cases, the updates are way too high and cause radical changes in values of parameters and can cause shifts the nature of training frequently. Since when we approach towards optimums, our speed of updating parameters should slow down but because of high learning rate, we do not have any breaks at all ! This will result in high swings in parameter values leading to divergence and thus again increasing number of iterations is only going to worsen things due to increasing computational cost.

Thus its important to understand that choosing your learning rate, normalisationa and scaling of data are very important in case required. Generally recommended values lie in the range of 0.01 to 1 but this is subject to your problem statement. I hope I have answered your question.

Regards,
Amit

Topic		Replies	Views
Question on weight initialization and exploding/vanishing gradients Improving Deep Neural Networks: Hyperparameter tun coursera-platform	9	673	May 23, 2021
So, what is vanishing/exploding gradient? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	7	792	August 19, 2023
Week 1, W initialization to large random number, and HE Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	524	August 31, 2021
Vanshing and exploding gradient NLP with Sequence Models week-3	3	518	April 8, 2022
Parameter initialization question Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	593	February 18, 2023

W1 assignment_Initialization

Related topics