Hello, I try to complete the def rnn_backward(da, caches) in Exercise 6 - rnn_backward
The results in Exercices 5 are correct with the test data, but in 6 not. The shapes are ok, but the data not. I tried to check the posts here, saw Im not only one, but cant figure out the problem.
I dont want to past here the code,
Please fix your call to rnn_cell_backward to include the gradient at time t and the incoming gradient. In your implementation, only the incoming gradient is used.
Another hint: Look at da.
Hello everyone, first of all thanks to those of you who have posted here about this issue. It was helpful for me to fix some bugs in the run_cell_backward() function.
Now I have the same problem as Justin. In fact, my code found the same gradients as he did. I have already checked the dimensions of the inputs in the function rnn_cell_backward(), da and cache to ask only for the slice t, but did not get any other gradient values. Also, the gradients of the rnn_cell_backward() function passed all previous tests. So I am quite lost as to how to solve this problem…
I would be very grateful for any ideas or comments!
It is stated in the notebook that Choose wisely the "da_next" and the "cache" to use in the backward propagation step.
I guess the problem is with how you are choosing the "da_next". It is not only the da (slice of t). You have to add the gradient of the loss with respect to the hidden state at time step t-1.
Please read the below text from the notebook:
Note that this notebook does not implement the backward path from the Loss ‘J’ backwards to ‘a’.
This would have included the dense layer and softmax which are a part of the forward path.
This is assumed to be calculated elsewhere and the result passed to rnn_backward in ‘da’.
You must combine this with the loss from the previous stages when calling rnn_cell_backward (see figure 7 above).
In other words, you have to add da_prevt with da (slice of t)
That is what is happening in that “+” sign in the green oval that I added at the right hand side of the diagram. It is the da for the current timestep plus the cumulative sum of all the da values from the later timesteps, which is da_{prev} from the point of view of those later timesteps. You can see the current “step” feeding the next da_{prev} off the left side of the diagram to the previous timestep. Of course this is “back prop”, so we are going backwards and in an RNN it’s “backwards in time”, right? Because there is just one “layer” but we repeat it over and over and feed the results forward.
In addition to the explanation in the text that Saif highlighted there, it’s also visible in the diagram in my previous post here. Remember that there are two outputs from each timestep in an RNN: the feed forward of the hidden state to the next timestep and the output branch that generates the \hat{y}^{<t>} for that timestep. As the text in Saif’s post says, we don’t actually do the work to compute the da from the \hat{y} branch in this assignment: it’s just given to us as an input. Our work is to compute the other branch of the gradients and we just add the da they give us.