If the weights are the same for each timestep, how does vanishing gradient problem only affect the gradients at earlier timesteps (therefore making long term dependencies difficult). The same weights at timestep 0 are used for timestep T. Shouldn’t learning for all the timesteps be screwed up?

The *weights* may be the same, but the *gradients* are not the same. You are forward propagating through (across) the timesteps, so on back propagation you are going backwards through the timesteps in the reverse order to compute the gradients. It’s a massive application of the Chain Rule: the further back you go, the more products you are multiplying together. Multiplying small numbers makes them smaller, right?

0.1 * 0.1 = 0.01

0.1 * 0.1 * 0.1 = 0.001

And so forth. You get the point …

I understand how the chain rule works when we are back propagating through standard neural networks and how that leads to vanishing gradients. Maybe I’m not understanding backprop through time correctly. From my understanding, we compute the gradients of the entire network for each individual time step, so within each individual time step we are applying the chain rule. So then to get the total gradient for that network, are we summing the individual gradients for each timestep up or multiplying?

I think the confusion here is what is meant by “time step”. I think you are confusing time step for “iteration”. In one *iteration*, we got through all the time steps. The output of timestep 1 is the input to timestep 2, right? So the Chain Rule is being applied through the timesteps (backwards of course) in every iteration.

I do not understand how the chain rule is being applied through all the time steps. If there are 4 words is a sequence, we will end up with dL1, dL2, dL3, dL4. The same neural network is used at each timestep. To compute dL4/dw, we use the chain rule like a standard neural network. The equation would be something like : dL4/dy4 * dy4/da4 * da4/dw. So then what would the equation for dL3/dw look like?

The point is that the loss is the second to last step (before you take the average of the losses to get J), right? So it’s not dL3/dw, it’s dJ/dw3 (or dL/dw3 if you prefer to keep it in vector form). So the Chain rule applies the same way it does in a normal “feed forward” network, except that you end up applying the gradients to the one set of shared weights when you actually apply them. Remember that we are composing functions *across timesteps* in the RNN case, instead *across layers* in the DNN or CNN case. So if we extend your example assuming exactly 4 “time steps”, it ends up being:

dw3 = \displaystyle \frac {\partial J}{\partial L}\frac {\partial L}{\partial A^{(4)}}\frac {\partial A^{(4)}}{\partial Z^{(4)}}\frac {\partial Z^{(4)}}{\partial W^{(4)}}\frac {\partial A^{(3)}}{\partial Z^{(3)}}\frac {\partial Z^{(3)}}{\partial W^{(3)}}

Of course the confusing part here is (as you say) that the W values are really all the same. But the point is we get a different gradient for every time step and then apply them all at the “update parameters” step.

how are we “applying them all” in the update step? Are we summing dw1, dw2, dw3, dw4 together before applying the update rule (w := w - alpha * dw)?

It’s been a while since I watched these lectures, so I don’t recall what Prof Ng says about this. If you take a look at the optional back propagation section of the RNN Step by Step programming exercise (C5 W1 A1), you can see that it sums the gradients over the time steps, but it doesn’t actually show us the “update parameters” step, so that is left as an exercise for the imagination. I assume it would be as you show: just apply the sum of the gradients multiplied by the learning rate.

But the reality here is that “we” don’t really do anything about this: as soon as Prof Ng has shown us how to build the basics ourselves directly in python and numpy, he switches to using TF/Keras for everything. That means that everything to do with the computations of gradients and applying them is handled “under the covers”. This is his standard pedagogical paradigm: it is important that we have a good intuitive grasp of how the algorithms work, but in real life no-one builds these complete algorithms from scratch: you just use your framework of choice (TF, PyTorch, Kaffe or a long list of others).