DLS course 5 week 1 assignment 1 LSTM backward pass equations

In the backward pass equations 7-21 there are two equations: da_prev and dxt that appear to be the exact same equation… It also states that for da_prev the forget weight Wf uses na index for the first n_a… so it says wf[:,: na]

This doesn’t make much sense to me because I thought there would need to be a comma in there and maybe it was a typo. It mentioned something about too many indices so I took out a colon. by using Wf[:, na] it ended up moving forward in the program to the next error…

dxt. In this case it says to use na from first till the end. so i switched it to Wf[na, :] note again the instruction has a weird typo? Wf[:,na :] which seems to be a typo. Just by adding the comma however it again says too many indices. Taking out one of the colons results in the error:

IndexError Traceback (most recent call last)
in
19 da_next_tmp = np.random.randn(5,10)
20 dc_next_tmp = np.random.randn(5,10)
—> 21 gradients_tmp = lstm_cell_backward(da_next_tmp, dc_next_tmp, cache_tmp)
22 print(“gradients["dxt"][1][2] =”, gradients_tmp[“dxt”][1][2])
23 print(“gradients["dxt"].shape =”, gradients_tmp[“dxt”].shape)

in lstm_cell_backward(da_next, dc_next, cache)
59 da_prev = np.dot(parameters[‘Wf’][:,n_a].T, dft) + np.dot(parameters[‘Wi’][:,n_a].T, dit) + np.dot(parameters[‘Wc’][:,n_a].T, dcct) + np.dot(parameters[‘Wo’][:,n_a].T, dot)
60 dc_prev = dc_next * ft + ot * (1 - np.square(np.tanh(c_next))) * ft * da_next
—> 61 dxt = np.dot(parameters[‘Wf’][n_a,:].T, dft) + np.dot(parameters[‘Wi’][n_a,:].T, dit) + np.dot(parameters[‘Wc’][n_a,:].T, dcct) + np.dot(parameters[‘Wo’][n_a,:].T, dot)
62 ### END CODE HERE ###
63

IndexError: index 5 is out of bounds for axis 0 with size 5

Hello @zac_builta!

Interesting question. But first, understand that there is no typo in the assignment.
Now let’s discuss your query one-by-one.

Before starting, let’s understand some Python stuff. Here, [:,:n_a] means that select all the rows and columns from start to n_a.
For example, if we have a matrix M with shape (3, 5) and n_a=3 , M[:, :n_a] will extract the submatrix containing all rows and the first three columns of M , which will be a matrix with shape (3, 3) .

Similarly, [:,n_a:] means that select all the rows but all columns starting from the n_a -th column until the end.
For example, if we have a matrix M with shape (3, 5) and n_a=2 , M[:, n_a:] will extract the submatrix containing all rows and columns from the third to the fifth column of M , which will be a matrix with shape (3, 3) .

So, is it make sense or not? All rows but only the first n_a columns…

This means that you select all rows but only one column, the n_a^{th} column. Got it?

Yes, the equations are the same but we have to choose different columns in both. In da_prev, we have to choose the columns from start to n_a while in dxt, we have to choose columns from n_a to end. Is it make sense to you or not?

If you still have any doubts, feel free to ask.

Best,
Saif.

2 Likes

Thank you, this is great information. I was able to get results without error when I fixed these mistakes. However it does appear some of my results are incorrect. All of my shapes are right but my values aren’t as expected. sometimes close. some of them are so close yet still a little bit off that makes me think there might be possible variance in the answers? Although some of my answers are completely off.

My values:
gradients[“dxt”][1][2] = 3.188668839350378
gradients[“dxt”].shape = (3, 10)
gradients[“da_prev”][2][3] = -0.06501557963785477
gradients[“da_prev”].shape = (5, 10)
gradients[“dc_prev”][2][3] = 0.7975220387970015
gradients[“dc_prev”].shape = (5, 10)
gradients[“dWf”][3][1] = -0.1479548381644968
gradients[“dWf”].shape = (5, 8)
gradients[“dWi”][1][2] = 1.0574980552259903
gradients[“dWi”].shape = (5, 8)
gradients[“dWc”][3][1] = 2.251763865232517
gradients[“dWc”].shape = (5, 8)
gradients[“dWo”][1][2] = 0.3313115952892109
gradients[“dWo”].shape = (5, 8)
gradients[“dbf”][4] = [1.23112653]
gradients[“dbf”].shape = (5, 1)
gradients[“dbi”][4] = [-1.22362657]
gradients[“dbi”].shape = (5, 1)
gradients[“dbc”][4] = [1.22592469]
gradients[“dbc”].shape = (5, 1)
gradients[“dbo”][4] = [-0.18974623]
gradients[“dbo”].shape = (5, 1)

Expected values:
|gradients[dxt][1][2] = |3.23055911511|
|gradients[dxt].shape = |(3, 10)|
|gradients[da_prev][2][3] = |-0.0639621419711|
|gradients[da_prev].shape = |(5, 10)|
|gradients[dc_prev][2][3] = |0.797522038797|
|gradients[dc_prev].shape = |(5, 10)|
|gradients[dWf][3][1] = |-0.147954838164|
|gradients[dWf].shape = |(5, 8)|
|gradients[dWi][1][2] = |1.05749805523|
|gradients[dWi].shape = |(5, 8)|
|gradients[dWc][3][1] = |2.30456216369|
|gradients[dWc].shape = |(5, 8)|
|gradients[dWo][1][2] = |0.331311595289|
|gradients[dWo].shape = |(5, 8)|
|gradients[dbf][4] = |[ 0.18864637]|
|gradients[dbf].shape = |(5, 1)|
|gradients[dbi][4] = |[-0.40142491]|
|gradients[dbi].shape = |(5, 1)|
|gradients[dbc][4] = |[ 0.25587763]|
|gradients[dbc].shape = |(5, 1)|
|gradients[dbo][4] = |[ 0.13893342]|
|gradients[dbo].shape = |(5, 1) |

Note that your results are printed showing 17 decimal places and the expected ones show only 12, but things should agree up to the 12th decimal place (with rounding from the 13th of course). If not, then you have made some sort of mistake. Note that this is an excruciating exercise in translating all the given formulas into code. Start with the first one that’s wrong and carefully compare your code to the formulas, remembering Prof Ng’s convention that * means elementwise and no operator means np.dot. Also note carefully the variable names they specify, which sometimes aren’t so obvious from the math symbols.

1 Like