If I understand correct, that if we initialize W that will be the same for each layer and if it will be less than 1 or much more, this influence of this W will be exponentially high in reducing or increasion of results of finish layers.
But how is it connected with gradient? Why is it named vanishing/exploding?
“Vanishing gradients” means that the gradients trend toward zero before the cost has been minimized, so the weights are no longer being significantly modified, and the cost becomes stuck.
“Exploding gradients” means that the gradients become increasingly large, which increases the weights, and the cost tends toward +Inf or -Inf.
First, understand what gradient is. Gradient is the derivative of loss with respect to parameters (W, b) (also A and Z). So, the gradient depends on the parameters. Now if our gradient updates the W to a very large number, our cost will just move back and forth and never reach the minimum. That is called exploding gradient. And, if our gradient updates the W to a very small number (approaching zero), our cost will find it very hard to reach the minimum as at every iteration, the change is very very small. Cost will be stuck. That is called a vanishing gradient.
So, do I understand correct, that the getting of vanishing/exploding gradients is possible only if in deep neuron network, for example more than 100 layers, all W in 100 layers will be less or more than zero? In this case they will influence on each other and last layers multiplies in calculations NN will be on exponentially small or big number?
So, if NN from 100 neurons for example, than if 50 of neurons will be with 0.5 W and remaining 50 with 1.5 W everything will be more less fine? Without problem of vanishing/exploding gradients?
It’s more complex than that. Besides architecture and initialisation of weights among others also hyperparameters or the activation function play a role, when it comes to ensure an effective training (with stable gradients), see also this thread: Activation functions - #2 by Christian_Simonis
It’s really good and illustrative, explaining what can happen with the gradient during training & backprop. I would recommend to watch this - it’s really worth it!
In addition to following all the links from Christian, I think there is a fundamental misunderstanding you have expressed there. We are talking about the gradients of W, not the values of W, right? The values of W may be close or zero or not close to zero, the question is whether those values give good predictions or not. If they don’t give good predictions, then we need the training to push them in the direction of better predictions, right? That’s the point. And it is the gradients that do the “pushing”. If the gradients are vanishing (zero or close to it), then the training can’t move towards a better result. If the gradients are exploding, then the training doesn’t even converge and pushes the W values to be even worse in terms of the resulting predictions. That is why vanishing or exploding gradients are a problem and we need a solution to that. The solution can be adjusting hyperparameters or changing the architecture of the network (e.g. by using a Residual Net instead of a simple convolutional net).
When training a deep neural network, the process involves calculating gradients that indicate the necessary adjustments to be made to the network’s weights to minimize the error between predicted and actual outcomes. These gradients are then propagated backward through the layers, from the output layer to the input layer, in a process known as backpropagation. The vanishing gradient problem arises when these gradients become progressively minuscule as they traverse the layers in reverse order. This diminishment in gradient magnitude can lead to weight updates that are barely perceptible for the initial layers of the network. Consequently, these early layers fail to learn meaningful representations from the data, resulting in sluggish learning or even stagnation. The use of activation functions that compress their input ranges, such as the sigmoid or tanh functions, exacerbates this issue. These functions tend to “flatten out” for extremely large or small inputs, causing their gradients to approach zero, thus contributing to the vanishing gradient phenomenon.
Exploding gradient
Represents the opposite scenario. During backpropagation, gradients can accumulate and amplify as they flow in reverse through the layers. This amplification can result in exceedingly large gradients, which in turn lead to weight updates that are far too substantial. This situation can lead to erratic and unstable training behavior, as weight adjustments become unpredictable. The risk of the exploding gradient problem increases when using activation functions that allow gradients to grow as they propagate, such as the popular Rectified Linear Unit (ReLU) activation function. If the network’s architecture lacks proper weight initialization and regularization strategies, the exploding gradient problem can hinder convergence by causing weights to deviate significantly from their desired values.