Vanishing gradients with RNNs

Here is what Andrew said in the video about Vanishing Gradients with RNNs:
An RNN, say an RNN processing data over 1,000 times sets, or over 10,000 times sets, that’s basically a 1,000 layer or like a 10,000 layer neural network.

I am not sure I understand his argument, since per his previous videos on RNNs, there is only one set of parameters for the input Wax, one set of parameters for the previous data Waa and one set of parameters for the output Way. Therefore, calculating the gradients for Wax, Waa and Way should not be considered as going through 1000 or 10000 layers, but only as a single layer.

Any help, explanation or tutorial would be helpful.

You have to think carefully about how back propagation works in an RNN. It propagates backwards through the timesteps, instead of backwards through the layers. But the net effect is the same: the gradients at the earlier steps are the products of all the gradients between that time step and the final one. In both cases you are composing functions: the output of a given timestep is input to the next timestep for the cell state (both plain vanilla hidden state and GRU or LSTM states). So you end up multiplying gradients and when you multiply small numbers, they get smaller just as in the multi-layer FCN or CNN case.

Of course it’s also more complex than FCNs or CNNs because there may be two paths for gradients to feed into a given timestep depending on the architecture of the RNN. In some cases there is a loss being calculated on the \hat{y}^{<i>} value as well. So that may mitigate the problem, since that value is not the product of potentially hundreds of factors and is added to the timestep path for the gradients. But even there, it would mean that the influence of the main path gradients has little effect on the result, which probably makes it harder to get a good solution through training.

2 Likes

@paulinpaloalto this may be too ‘unserious’ of a response for your very serious one, but I ‘sorta’ think of it as playing a game of ‘telephone’.

1 Like

Just to refresh my memory, I went back and watched the two relevant lectures in C5 W1:

Backpropagation Through Time
Vanishing Gradients with RNNs

You’re right that he doesn’t really explain any details of how the gradients propagate between the timesteps. He just draws the forward and backward arrows and leaves it at that. I’m guessing that’s because he’s trying not to scare people away with too much calculus. The course is, after all, designed not to require any calculus knowledge. So the good news is you don’t need to know calculus, but the bad news is that you just have to take his word for a lot of things.

The one place where you can see this is in the (optional and ungraded) Back Prop section of the first assignment in Week 1. Here’s the relevant diagram:

The key point to notice is that the input coming from the right side is \displaystyle \frac {\partial J}{\partial a^{<t>}} and the output is \displaystyle \frac {\partial J}{\partial a^{<t-1>}} and the latter includes the former as a factor. And of course it’s recursive: the input already includes all the previous layer values as factors.

So that’s where you get the point I made in my earlier reply that the timesteps act analogously to the separate layers in an FCN or CNN in terms of how the gradients propagate.

2 Likes

The fundamental math behind this is the Chain Rule, which arises from the way the derivative of a composite function works. Suppose I have two functions f and g and I compose them:

G(z) = g(f(z))

meaning that the output of the first “inner” function f(z) becomes the input to the “outer” function g(z). Then the derivative of G(z) is:

G'(z) = g'(f(z)) * f'(z)

That’s where the product of the derivatives arises. If you express that with the “fraction” notation for derivatives as in \displaystyle \frac {df}{dz}, you get the classic expression of the Chain Rule.

Of course everything we are doing in forward propagation in multilayer networks or RNNs is a huge serial composition of functions. Rather than the game of telephone, the metaphorical picture that speaks to me is a huge onion, built up in layers during forward propagation. Then when you get to back propagation, you are peeling the onion one layer at a time as you compute and accumulate the derivatives.

2 Likes