From the course and this post, I get the message that there is a problem of exploding/vanishing gradients if the weights are initialized too large(small). To demonstrate this, Andrew gave an example of neural network with many layers and all having linear activation functions. Suppose the weights are initialized as matrix larger(smaller) than identity matrix, then the prediction:
y hat = W[L] k^(L-1) x
where k is the number larger(smaller) than one.
This way if W[0] and other W[t], t small, changes, y hat would change by a lot(nearly none), meaning the gradient would be too large(nearly none), leading to oscillation(early convergence before cost reaches minimum).
My question is, wouldn’t this problem automatically vanish after a few iterations? The gradients are only large(nearly none) for the W[t], t small, but they should be fine for W[t], t large. This way W for the deeper layers would still get updated correctly. Since they are updated correctly, the relationship y hat = W[L] k^(L-1) x would no longer hold.
More generally, my question is this: Initializing the weights to be too large(small) would make the gradients of the shallow layers to be improperly large(small), but shouldn’t affect the gradients for the deep ones. So the weights of the deep layers would get adjusted to more proper values after a few iterations, which means they would no longer be too large(small). The chain that leads to improperly large(small) gradients for the shallow layers would be broken. Therefore, shouldn’t the problem of exploding/vanishing gradient be solved by the training process itself?
thank you for the question. It is related to a quite interesting and important topic in machine learning. I will try to explain that.
Gradient exploding and gradient vanishing problems are not related to the training steps, but depend on the computation graph (e.g. the model itself). Feedforward networks with many layers have deep computational graphs. Each layer applies an operation that usually involve weight multiplication.
For example, many layers apply affine transformation wx + b. Where w is a weights matrix of some layer, and x is a weights matrix of the previous one. Just for explanation reasons, let us omit the bias term b. We may now see a forward pass through the computation graph as a series of weight multiplications:
To put it simply (a correct explanation will involve eigendecomposition of the weights matrix), weights with magnitude less than one will decay to zero (shrink) and weights with magnitude greater than one will explode.
The same goes for backward propagation involving multiplication of the gradients.
Vanishing gradients is one of the biggest optimization challenges in machine learning. For some architectures, such as recurrent networks (or RNNs) which construct very deep computational graphs by repeatedly applying the same operation at each time step, it rises especially pronounced difficulties.
A proper weights initialization is just one way to mitigate the problem. There are other techniques such as batch normalization and skip connections that address vanishing gradients. Gradient clipping is a popular technique to address exploding gradients.
Thank you for your answer. However, I don’t feel like my question is resolved yet, and I would appreciate further clarification.
You mentioned:
My question is why would all the weights have magnitude less or larger than one after a few iterations? Although you mentioned the gradient exploding and vanishing problems are related to the computation graph and not the training steps, my confusion is why doesn’t the training steps automatically solve the problems? And the reason why I feel like the training process will solve the problem is the following:
During training, the deeper layers will have reasonable gradients (not enough layers for multiplication to make the gradients unreasonable large/small), so they would be updated reasonably.
After these deep layers get updated to reasonable values (not all larger or smaller than 1), in the next iteration, the next deep-but-not-so-deep-layers will have reasonable gradients as well.
Eventually, even the shallower layers should have reasonable gradients if the the deeper layers are not all larger or smaller than 1. So the vanishing and exploding gradients would be resolved.
The gradients are only large(nearly none) for the W[t], t small, but they should be fine for W[t], t large. This way W for the deeper layers would still get updated correctly.
It’s not clear why do you assume that weights for deep layers should be fine.
My question is, wouldn’t this problem automatically vanish after a few iterations?
Consider just a single training iteration. Each layer has weights. For the forward pass, weights of the last layer may be seen as a product of weights of all previous layers. For the backward propagation, gradients of the first layer may be seen as a product of gradients of all following layers. The product of small values results in a very small value (vanishing) and the product of large values results in extremely big value (exploding).
The vanishing gradients is a common problem for RNNs because the same values of its hidden state are applied at each time step.
You can also check out the Section 8.2.5 of the Goodfellow’s Deep Learning book that has a nice explanation on the intuition behind vanishing gradients.
Or more generally, gradients of the lth layer could be seen as the product of the gradients from the (l+1) to the Lth layer. So for a large lth (say l = L-2), there wouldn’t be a problem of exploding gradient (the gradient of the L-2 layer is just the product of the gradients of 2 layers). Since these deep layers have reasonable gradients, they would get updated in a reasonable way. After a few iterations, they should be updated to some reasonable value, which means they won’t all be larger or smaller than 1.
But forward propagation doesn’t affect the weights of the layers. The thing that is exploding in forward propagation is y (y = w0w1…*wn ~ w^n), not w (wn is not equal to w^n).
Therefore, ws of the deep layers will not explode because they are deep. They would be the value they were initialized as, and then get updated by the gradients in backpropagation in a reasonable way.
My point is, if z = wx were exploding there wouldn’t be neither any next step, nor computation of gradients. Exploding means the learning stops with a failure.
Okay I think I finally understand. So what you mean is for deep networks with poor initialization, we won’t even be able to get to the backward propagation step, because z already exploded/vanished in the forward propagation. Since backward propagation comes after forward propagation, we can’t update any weights, whether for the shallow or deep layers.
I read it but your answer above was clearer for me
In Andrew’s example since the activation functions were linear the weight get multiplied many times and therefore we see these vanish/exploding problems, but if the activation functions are for example tanh, than the output from that would be between 1 and -1. So how is this getting explode when for every layer the values becomes in these range again and again?tanh’ is also between 1 and -1.