[Week 1] How are the weights updated in backpropagation thorough time?

In a standard NN you perform gradient descent on each layer because for each layer you have a vector of weights.

In the RNN from this video you have the same set of weight for each layer. Doesn’t that mean that you perform gradient descent only once; hence you update the weights only once through backpropagation?

I think it’s worth watching the video again with the following thoughts in mind:

Rather than having different layers with different weights, we have a single RNN “cell” with one set of weights. But then we use that cell multiple times in a given forward pass. Prof Ng shows how the loss function is the sum of the losses at each application of the cell at each “time step”. So at each time step, we have a function, which has a derivative. When we do the back propagation phase, what is happening is that the same set of weights gets updated multiple times in one “iteration” of back prop. You first update for the loss at the final time step, then again for the loss at timestep -1, then -2 and so forth.

Of course you are also iterating over all the training samples to complete one “epoch” or doing a two level loop over the minibatches and then the samples within a given minibatch. Then you repeat the full epoch as many times as you need to get good convergence.

4 Likes

Thank you very much, Paul! You explained it extremelly well!

I’m glad to hear that it helped. There is (at least) one more subtlety of the update process that we can’t completely tell from what Prof Ng says: since the loss is the sum of the timestep losses, you could define the gradient as the sum of the timestep gradients, meaning that you’d effectively be applying them all at once. I think the two methods are equivalent, unless you go to the full complexity of recomputing the earlier timestep gradients based on the updated weight values from the later timestep gradients, but that sounds too complicated and computationally expensive. My bet would be that they just go for the simpler implementation and it all comes out in the wash, since the update behavior is stochastic anyway.

2 Likes

Thanks you @paulinpaloalto, I was also in doubt whether updating the weights T_y times for one backprop was correct due to derivatives for earlier time steps computed with already updated weights would differ from \nabla \mathcal{L}.

But your last sentence confuses me. Do you mean the following?
When simply adding the gradients from each time step and after backprop updating the weights once, you compute \nabla \mathcal{L} correctly. When simply updating the weights T_y times (once for each time step) without repetitively recomputing the loss for earlier time steps, you get slightly different results, but that is OK since SGD does not need perfect gradients for the update.

I may be missing your point, but I would say that the two methods are equivalent: you’re adding the sum of the gradients or adding them one at a time. That should give the same answer. But that must be too simplistic an interpretation of your point, since I’m sure you understand that.

What I was trying to say above was what you said here:

Doing the two level iteration and recomputing the gradients at timestep T_y - 1 based on the application of the gradients for timestep T_y (and so forth through all the timesteps) would be way more expensive and all this is approximate anyway, so we can get sufficiently good results at much lower cost with the method that’s more efficient even if it’s not quite as mathematically correct. The gradients are vectors in a plane that’s tangent to the surface at the current point, so they’re all pointing in slightly the wrong direction in any case, right? So we need to be careful not to take too large a step on any given iteration and hope that it all comes out in the wash of many thousands of iterations. Experimental evidence seems to suggest that this method works.

Of course the other thing to consider here is that, just as in Course 4, we’re leaning pretty heavily on TF/Keras here for the real implementation details. Prof Ng always shows us how to build things in python ourselves first, so that we have better intuition for what’s going on. But then we switch to using the platform for real work. So both here and in Course 4 Week 1, we only get a very basic sketch of the actual implementation of backprop as an optional part of the Step by Step assignment and then we never really use that code. I have not had the courage or time to try looking at the source code for TF or PyTorch or … to understand what is really happening “under the covers” here.

Thank you Paul for your answer, covering many SGD aspects and thus drawing the big picture.

If you are adding them one at a time (without repetitively recomputing the loss for earlier time steps), you will not change the loss function, but the coordinate where you take the gradients, right? E.g. if you immediately update the weights based on the gradient for \mathcal{L}^{T_y} , will you not take the gradient for \mathcal{L}^{T_y-1} at a slightly different coordinate?

Yes, I’m pretty sure we are saying the same thing here.

Yes, you’ve just spelled out in detail what I meant when I said that method is more mathematically correct. But it’s also way more expensive, since you’re introducing another layer of loop that involves recomputing the gradients. So my guess is that they don’t do that, but do the simplistic batch application of all the gradients across the timesteps, because of the added expense of the “fully correct” method. And the simplistic more inaccurate method apparently is “good enough”. Note that I haven’t actually investigated any more deeply and am just assuming that Prof Ng and Geoff Hinton and Yann LeCun and their various teams of grad students have explored this and Prof Ng is just showing us what is known to work.

1 Like

Hi paulinpaloalto,

I had similar question with paul2048 and it has been resolved but now have slightly different question.
You can correct me if I am wrong below.

in LSTM, when we do forward-prop, if I remember correctly, we used same parameters(Wi,Wf,Wc,bi,bf,bc) at each time step(𝐱⟨𝑡⟩, t = from 1 to T_x) . wasn’t it?

Then in back-prop, we have different dWi, dWf,dWc,dbi,dbf,dbc at eatch time step?
how do back-prop at each epoch with this? Do we use final derivatives (dWi, dWf…at time step t=1) and apply it in all time step or use different derivatives at each time step?

LSTM doesn’t really change the fundamental method of applying backprop: it works the same as with a basic RNN, it’s just that you have a lot more complex “state” in the cell. But you still do the backprop updates the same way: at each timestep.

Sorry. Do you mean that when we do forward-prop, we start with one set of parameters(Wi,Wf,Wc,bi,bf,bc) and apply to all time step(t=1 to T_x) but as we go through back-prop, parameters will be updated differently at each time step due to different derivatives(e.i. dWi at t=1 will be different from dWi t=3)?
Then after training is finished, we will have T_x number set of parameters(Wi,Wf,Wc,bi,bf,bc)?

Yes, the parameters never change on forward prop, right? For any type of network, not just RNNs. You just apply them to the inputs and see what answers you get. Then you use backprop to push the parameters in directions that we hope will give a better answer. And in the case of an RNN (either plain, GRU or LSTM) there is one set of parameters and hidden state for the cell that gets used repeatedly on forward prop, but the inputs and outputs are different at each timestep and the hidden state also changes. Then the parameters that control the hidden state and outputs get updated differently by the gradients at each timestep when you do backprop because the output at each timestep is different.

Thanks paulinpaloalto for answer.