# So, what is vanishing/exploding gradient?

I understood very poorly this slide

If I understand correct, that if we initialize W that will be the same for each layer and if it will be less than 1 or much more, this influence of this W will be exponentially high in reducing or increasion of results of finish layers.

But how is it connected with gradient? Why is it named vanishing/exploding?

â€śVanishing gradientsâ€ť means that the gradients trend toward zero before the cost has been minimized, so the weights are no longer being significantly modified, and the cost becomes stuck.

â€śExploding gradientsâ€ť means that the gradients become increasingly large, which increases the weights, and the cost tends toward +Inf or -Inf.

1 Like

First, understand what gradient is. Gradient is the derivative of loss with respect to parameters (W, b) (also A and Z). So, the gradient depends on the parameters. Now if our gradient updates the W to a very large number, our cost will just move back and forth and never reach the minimum. That is called exploding gradient. And, if our gradient updates the W to a very small number (approaching zero), our cost will find it very hard to reach the minimum as at every iteration, the change is very very small. Cost will be stuck. That is called a vanishing gradient.

Best,
Saif.

1 Like

This thread might be worth a look, @someone555777: Vanishing/Exploding gradients C2W1 - #2 by Christian_Simonis

Hope that helps!

Best regards
Christian

1 Like

So, do I understand correct, that the getting of vanishing/exploding gradients is possible only if in deep neuron network, for example more than 100 layers, all W in 100 layers will be less or more than zero? In this case they will influence on each other and last layers multiplies in calculations NN will be on exponentially small or big number?

So, if NN from 100 neurons for example, than if 50 of neurons will be with 0.5 W and remaining 50 with 1.5 W everything will be more less fine? Without problem of vanishing/exploding gradients?

Itâ€™s more complex than that. Besides architecture and initialisation of weights among others also hyperparameters or the activation function play a role, when it comes to ensure an effective training (with stable gradients), see also this thread: Activation functions - #2 by Christian_Simonis

Did you watch Andrewâ€™s video? https://youtu.be/qhXZsFVxGKo

Itâ€™s really good and illustrative, explaining what can happen with the gradient during training & backprop. I would recommend to watch this - itâ€™s really worth it!

Best regards
Christian

In addition to following all the links from Christian, I think there is a fundamental misunderstanding you have expressed there. We are talking about the gradients of W, not the values of W, right? The values of W may be close or zero or not close to zero, the question is whether those values give good predictions or not. If they donâ€™t give good predictions, then we need the training to push them in the direction of better predictions, right? Thatâ€™s the point. And it is the gradients that do the â€śpushingâ€ť. If the gradients are vanishing (zero or close to it), then the training canâ€™t move towards a better result. If the gradients are exploding, then the training doesnâ€™t even converge and pushes the W values to be even worse in terms of the resulting predictions. That is why vanishing or exploding gradients are a problem and we need a solution to that. The solution can be adjusting hyperparameters or changing the architecture of the network (e.g. by using a Residual Net instead of a simple convolutional net).