I got program correct but still I don’t understand the concept of some part.
I tried to read thread as much I could but I didn’t find same question.
in excercise 8 istm_backward, we keep on adding dWf as the loop move from T_x to 1.
dWf += gradients["dWf"]
dWi += gradients["dWi"]
dWc += gradients["dWc"]
but why do we add up dWf as we each-time go backward(from T to 1)?
shouldn’t it be just like below?
dWf = gradients["dWf"]
dWi = gradients["dWi"]
dWc = gradients["dWc"]
Hope someone explain why we doing this.
The reason is the way that RNNs work: the same cell with the same coefficients is used for each timestep. Each timestep changes the results and so gives gradients during back propagation. The way you take that into account is by adding the gradients at each timestep. Of course you’re also averaging them over the training samples as well, but you may be doing Stochastic Gradient Descent. Of course there is some ambiguity here: the gradients at a given timestep include the gradients at all the later timesteps because of the Chain Rule. Should we apply them only one step at a time and then recompute? That adds another loop to the whole process and would be very inefficient, so we just add them all up at a given iteration. Gradient Descent is an approximation method and is statistical anyway and it apparently works well enough this way.
Here’s another thread from a while back that discusses this same point in some detail. Please start with the linked post and read forward through the thread.
Thanks paulinpaloalto. I went the another thread and have better understanding now.