C5, W1A1 optional RNN back propagation

Need help on the function: rnn_backward(da, caches):
I have initialized the gradients as follows but got incorrect results.
# initialize the gradients with the right sizes (≈6 lines)
dx = np.zeros((n_x, m, T_x))
dWax = np.zeros((n_a, n_x))
dWaa = np.zeros((n_a, n_a))
dba = np.zeros((n_a, 1))
da0 = np.zeros((n_a, m, T_x))
da_prevt = np.zeros((n_a, m, T_x))

Loop through all the time steps

for t in reversed(range(T_x)):
    # Compute gradients at time step t. Choose wisely the "da_next" and the "cache" to use in the backward propagation step. (≈1 line)
    gradients = rnn_cell_backward(da[:, :, t], caches[t])

# Set da0 to the gradient of a which has been backpropagated through all time-steps (≈1 line) 
da0 = da_prevt

I got:
gradients[“dx”][1][2] = [-0.15028183 -0.34554547 0.02071758 0.01483317]
as compared to the expected values:
gradients[“dx”][1][2] = [-2.07101689 -0.59255627 0.02466855 0.01483317]

Note: the shapes are all correct though.

Thanks!

Can anybody help please?

You are missing the addition:

1 Like

Thank you for your time, Jonaslalin. I am assuming it is allowed to post my codes other than the graded portion. Here is my completed coding of the rnn_backward function. Note: the rnn_cell_backward function was checked and the outputs were the same as the expected ones.

UNGRADED FUNCTION: rnn_backward

def rnn_backward(da, caches):
“”"
Implement the backward pass for a RNN over an entire sequence of input data.

Arguments:
da -- Upstream gradients of all hidden states, of shape (n_a, m, T_x)
caches -- tuple containing information from the forward pass (rnn_forward)

Returns:
gradients -- python dictionary containing:
                    dx -- Gradient w.r.t. the input data, numpy-array of shape (n_x, m, T_x)
                    da0 -- Gradient w.r.t the initial hidden state, numpy-array of shape (n_a, m)
                    dWax -- Gradient w.r.t the input's weight matrix, numpy-array of shape (n_a, n_x)
                    dWaa -- Gradient w.r.t the hidden state's weight matrix, numpy-arrayof shape (n_a, n_a)
                    dba -- Gradient w.r.t the bias, of shape (n_a, 1)
"""
    
### START CODE HERE ###

# Retrieve values from the first cache (t=1) of caches (≈2 lines)
(caches, x) = caches
(a1, a0, x1, parameters) = caches[0]

# Retrieve dimensions from da's and x1's shapes (≈2 lines)
n_a, m, T_x = da.shape
n_x, m = x1.shape 

# initialize the gradients with the right sizes (≈6 lines)
dx = np.zeros((n_x, m, T_x))
dWax = np.zeros((n_a, n_x))
dWaa = np.zeros((n_a, n_a))
dba = np.zeros((n_a, 1))
da0 = np.zeros((n_a, m))
da_prevt = np.zeros((n_a, m))

# Loop through all the time steps
for t in reversed(range(T_x)):
    # Compute gradients at time step t. Choose wisely the "da_next" and the "cache" to use in the backward propagation step. (≈1 line)
    gradients = rnn_cell_backward(da[:, :, t], caches[t])
    # Retrieve derivatives from gradients (≈ 1 line)
    dxt, da_prevt, dWaxt, dWaat, dbat = gradients["dxt"], gradients["da_prev"], gradients["dWax"], gradients["dWaa"], gradients["dba"]
    # Increment global derivatives w.r.t parameters by adding their derivative at time-step t (≈4 lines)
    dx[:, :, t] = dxt
    dWax += dWaxt
    dWaa += dWaat 
    dba += dbat
    
# Set da0 to the gradient of a which has been backpropagated through all time-steps (≈1 line) 
da0 = da_prevt
### END CODE HERE ###

# Store the gradients in a python dictionary
gradients = {"dx": dx, "da0": da0, "dWax": dWax, "dWaa": dWaa,"dba": dba}

return gradients

outputs:
gradients[“dx”][1][2] = [-0.15028183 -0.34554547 0.02071758 0.01483317]
gradients[“dx”].shape = (3, 10, 4)
gradients[“da0”][2][3] = -0.17268893183890754
gradients[“da0”].shape = (5, 10)
gradients[“dWax”][3][1] = 4.081485734449453
gradients[“dWax”].shape = (5, 3)
gradients[“dWaa”][1][2] = 1.056012342849445
gradients[“dWaa”].shape = (5, 5)
gradients[“dba”][4] = [-0.12427391]
gradients[“dba”].shape = (5, 1)

You are missing something here. Hint: you are not using da_prevt

1 Like

Thank you! I was confused and thought that da[:, :, t] was the “da_next” being passed in to cnn_cell_backward in the loop for each time step ‘t’. Now I guess (I am using the word ‘guess’ and I will ask a question at the end of this paragraph) that the output ‘da_prevt’ should be passed to the cnn_cell_backward in the loop. So right after initializing ‘da_prevt = np.zeros((n_a, m))’, I set it to ‘da_prevt = da[:, :, -1]’ outside of the loop. In the loop, I call the cnn_cell_backward(da_prevt, caches[t]) to compute the gradients. I still get the wrong outputs. I am not certain where I made a mistake. (now in the initialization part there are 7 lines). One more relevant question, if ‘da’ being passed into ‘cnn_backward’ is “Upstream gradients of all hidden states, of shape (n_a, m, T_x)”, shouldn’t ‘da[:, :, t]’ be passed in instead of ‘da_prevt’ in the loop for each rnn_cell_backward call?

# initialize the gradients with the right sizes (≈6 lines)
dx = np.zeros((n_x, m, T_x))
dWax = np.zeros((n_a, n_x))
dWaa = np.zeros((n_a, n_a))
dba = np.zeros((n_a, 1))
da0 = np.zeros((n_a, m))
da_prevt = np.zeros((n_a, m))
da_prevt = da[:, :, -1]
# Loop through all the time steps
for t in reversed(range(T_x)):
# Compute gradients at time step t. Choose wisely the “da_next” and the “cache” to use in the backward propagation step. (≈1 line)
gradients = rnn_cell_backward(da_prevt, caches[t])
# Retrieve derivatives from gradients (≈ 1 line)
dxt, da_prevt, dWaxt, dWaat, dbat = gradients[“dxt”], gradients[“da_prev”], gradients[“dWax”], gradients[“dWaa”], gradients[“dba”]
# Increment global derivatives w.r.t parameters by adding their derivative at time-step t (≈4 lines)
dx[:, :, t] = dxt
dWax += dWaxt
dWaa += dWaat
dba += dbat

# Set da0 to the gradient of a which has been backpropagated through all time-steps (≈1 line) 
da0 = da_prevt
### END CODE HERE ###

outputs:
gradients[“dx”][1][2] = [0.04036334 0.01590669 0.00395097 0.01483317]
gradients[“dx”].shape = (3, 10, 4)
gradients[“da0”][2][3] = -0.0007053016291385033
gradients[“da0”].shape = (5, 10)
gradients[“dWax”][3][1] = 8.452426371294356
gradients[“dWax”].shape = (5, 3)
gradients[“dWaa”][1][2] = 1.2707651799408062
gradients[“dWaa”].shape = (5, 5)
gradients[“dba”][4] = [-0.50815277]
gradients[“dba”].shape = (5, 1)

Thank you so much!

now you are missing da[:, :, t]

Thanks jonaslalin! :grinning: I got it! So the ‘da_next’ needs to be updated before it is passed into to the cnn_cell_backward call.

1 Like

you can see da_next like this: da[:,:,t] + da_prevt
where da_prevt is initialized with zeros for the last rnn cell.
take account that da_prevt is updated in each reversal iteration of “t”

Maybe will help someone in future - bit of scribbles to understand whats going on :slight_smile:

1 Like

How did you create this. Very good for understanding and keeping things in perspective.