I have a question
when doing rnn_backward(da, caches) function
da is given,
while gradients = rnn_cell_backward(da[:,:,t], caches[t])
dxt, da_prevt, dWaxt, dWaat, dbat = gradients[“dxt”], gradients[“da_prev”], gradients[“dWax”], gradients[“dWaa”], gradients[“dba”]
da_prevt is calculated by rnn_cell_backward function
why da_prevt != da[:,:,t-1] ?
i am so confused
I’m not sure I understand your question, but my take is that the point is that backward propagation is going backwards: we start from the cost and then propagate the gradients in the opposite direction from forward propagation. In an RNN, things are a bit more complicated because at each timestep, we are project “forwards” in two directions: towards the \hat{y}^{<t>} of the current timestep (at least in the case of a “many to many” RNN) and also to the updated hidden state a^{<t>} that will be input to the next timestep. So when we go backwards, we get gradients from both of those directions.
But maybe you’re saying that da[:,:,t-1] is basically the same as what they mean by da_prevt
in this formulation. I think that’s correct. In other words, what they are deriving is how to compute da[:,:t-1], which is used to compute gradients for the various weight matrices as well.
All this is pretty complicated, so please let me know if I’m missing your real point here.
hi paul,
in the programming exercise,
my [def rnn_cell_backward(da_next, cache): return gradients] function passed the tests.
But in the next function [def rnn_backward(da, caches): return gradients] can not pass the tests.
my code is like
for t in reversed(range(T_x)):
# Compute gradients at time step t. Choose wisely the “da_next” and the “cache” to use in the backward propagation step. (≈1 line)
print('t is: ', t, ‘, da_prevt==da[:,:,t]’, da_prevt==da[:,:,t])
gradients = rnn_cell_backward(da[:,:,t], caches[t])
# Retrieve derivatives from gradients (≈ 1 line)
dxt, da_prevt, dWaxt, dWaat, dbat = gradients[“dxt”], gradients[‘da_prev’], gradients[“dWax”], gradients[“dWaa”], gradients[“dba”]
why in each iteration, da_prevt != da[:,:,t] ?
it says Arguments:
da – Upstream gradients of all hidden states, of shape (n_a, m, T_x)
caches – tuple containing information from the forward pass (rnn_forward)
does it mean da comes from y stream and hidden state a stream?
can I should convert da[:,:,t] to da which only related to hidden state a stream?
Sorry, I hadn’t looked at the backprop section in a while. It turns out they leave out the \hat{y} part of the back prop according to this comment in the instruction:
Note :
rnn_cell_backward
does not include the calculation of loss from 𝑦⟨𝑡⟩�⟨�⟩. This is incorporated into the incomingda_next
. This is a slight mismatch withrnn_cell_forward
, which includes a dense layer and softmax.
So they’re really sort of simplifying things a bit here, since we’re not really going to use this code. When we actually want to train a model, we’ll use TF and that handles all the backprop for us magically. This is just to give us some intuition about how things work.
What they mean by the “choose wisely” comment is that you need more than just da[:,:,t] to form da_next
. It also includes the da_prevt
value.
The point of da_prevt
is that it was da_next
from the next timestep, right? See the diagrams in the instructions. And you can see how that is computed in rnn_cell_backward
. Notice in rnn_backward
, they start by initializing da_prevt
to all zeros, because we’re starting backprop at the last timestep, so there is no “next” in that case. Then we get it as an output from rnn_cell_backward
at each subsequent (well, previous really) timestep.