It seems I’m not getting this right… obviously my choices for da_prev and caches are not right.
It runs, but the numbers are not correct.
I’m primarily confused why I get an array of all upstream gradients da for all timesteps (it’s shape is (n_a,m,T_x) when my method rnn_cell_backward also returns da_prevt to pass on to the previous timestep. What is the different between the two "da"s I have here? I would have expected to get da[t+1] and then calculate all the other values of da. Or, because I’m missing the part with the softmax-function in my part, do I have to totally ignore my own result?
The explication for rnn_backward is very short compared to the explication of rnn_cell_backward…
Hi, I also have the same issue. I’m not sure what’s the difference between the da_prev provided by da and the da_prev calculated by rnn_cell_backward, and I don’t really understand what “Choose wisely the “da_next” and the “cache”…” mean…
I finally understand the difference between the two da_prev’s… The da_prev given by da is the gradient of the loss with respect to da through the dense layer and softmax’s path, whereas the da_prev given by rnn_cell_backward is the gradient of the loss with respect to da through the time steps’ path. Therefore when passing the total gradient of the loss with respect to da, you’ll need to add the two da_prev’s (since the cost is affected by the two paths simultaneously).
Hope this helps anyone who’s also struggling on this problem.
Thanks for this. Was totally getting thrown here. I definitely think the da that’s fed into RNN_backward should have a different name, like da_softmax or something to make clear it is additional to the derivatives with respect to a that are being calculated through the RNN cells and fed backwards as da_prevt.
This answer helped me get the right answer in rnn_backward, but I don’t understand why. I think my misunderstanding is related to the following questions:
I didn’t understand the comments about “Note that this notebook does not implement the backward path from the Loss ‘J’ backwards to ‘a’. This would have included the dense layer and softmax, which are a part of the forward path.” The softmax is only used on the ‘y’ calculation, so why are they relating it to ‘a’ which uses tanh?
I also didn’t understand why Figure 7 had the addition of da_next and a_t drawn in green. I don’t remember this addition being explained in a video, so if it was, please tell me which video to rewatch.
Which is the “dense layer” in Figure 2? The Waa, Wax, or Wya? My understanding is that each W (ie Waa or Wax) matrix represents a number of neurons equal to the number of columns of W, so they are each dense.
Could someone please explain these different things I didn’t understand, because I think they would all contribute to my understanding of why I had to add da_prevt to da_t.
So that looks exactly like a Fully Connected Feed Forward output layer with softmax as the activation. The “Dense Layer” there would be:
z = W_{ya} \cdot a^{<t>} + b_y
The point is that all the gradients are derivatives of J, but we are using the Chain Rule here to decompose the calculation of those gradients. They are telling you that you only need to worry about the parts of the computation that come before the Dense + Softmax layers: they give you those last few layers of the gradient computation as an input. Here we are computing only the factors that come before those last few factors.
As far as I understand, it’s OK to post the code for this optional exercise, so here is mine:
{moderator edit - solution code removed}
Does anybody know what I’m doing wrong? I would have assumed that the most subtle point would be the “da[:, :, t] + da_prevt” part, but apparently I’m missing something else and I’m only getting some correct values. Btw my rnn_cell_backward gives me the correct values. Here’s what I get for rnn_backward:
I just did a careful compare between your code and mine and I’d say it all looks correct to me. So maybe the first theory is that you’re not actually running the code you are looking at: if you type new code into a function cell and then just call the function again, it does nothing. It just reruns the old code. You have to actually click “Shift-Enter” on the modified cell in order to get the new code added to the runtime image. Or the simple “sledgehammer” way to get everything back in sync so that WYSIWYG is:
Kernel → Restart and Clear Output
Cell → Run All
If you still get mismatching answers after that, then I guess the second theory would be that your rnn_cell_backward code is not correct. If the “sync” doesn’t help, then it’s probably time to look at your whole notebook. I’ll send you a DM about how to do that.
Yes, I answered on the DM thread a couple of days ago. It’s an interesting problem. To close the loop on the public thread, the main problem turned out to be in rnn_cell_backward: there are ways to write that code that pass the test cases for that function (and in fact are arguably correct, although not the most efficient way to write the code) that then fail in the rnn_backward test case. That is because of the artificial way they generated the test case for rnn_backward. I’ll file a bug about the test case, but in the meantime a workaround was given on the DM thread.
We are working with tanh as one of the activation functions, so in backprop we need to compute the derivative of tanh. Here’s the relevant math:
We are in a situation where we have both the value or z and a from either the caches passed from forward prop or the inputs to the function. They just generated the test case by generating random values for all the inputs to rnn_backward at least in the cases that matter for the above. So with the given input values it is not true that:
a = tanh(z)
They were assuming that you would write the derivative using (1 - a^2), since that is computationally simpler and more efficient. If you use the form (1 - tanh^2(z)), you are correct, but it doesn’t pass the test case.
It’s great to hear that the information was useful. Thanks for confirming!
Maybe it’s worth stating the higher level point here in a more general way:
Of course when we write code, correctness is essential and is the first order of business, but correctness is not the only goal. A very important secondary goal is efficiency. The whole point of forward and backward propagation is that they are the backbone of the training process, which means that we’re going to be executing literally 10s of thousands of iterations and potentially more of these algorithms on typically quite large datasets to perform the training. For extreme examples, take a look at the descriptions of the model sizes and dataset sizes for the recent LLMs like GPT-3 and GPT-4. Whatever we can do to streamline the algorithms can make a big difference in the time, compute cost and thus overall expense and practicality of the training.
As a mathematician, it’s one thing to write e^x or \sqrt{x} and then use what we know about the properties of those functions to prove theorems or do closed form solutions of differential equations (where possible), but actually computing the value of e^x or \sqrt{x} for some specific input value is actually not a simple matter. I haven’t looked at the actual details for e^x, but in the square root case you’re doing an iterative approximation based on Newton’s Method at least in principle (e.g. to find the roots of x^2 - 2 = 0). So evaluating tanh(z) elementwise on a vector with potentially thousands of 64 bit floating point entries is not a trivial matter in terms of compute cost. That’s the reason why they typically use caches to pass the various needed intermediate values computed during forward propagation to backward propagation: if you’ve already computed something like tanh(z) once during forward prop, it’s a big win to just re-use that value where it’s required in the back prop algorithm. At the cost of some stack space and a few additional memory accesses, you save a large number of compute cycles.
Of course you can say that perhaps the folks who wrote the test cases for the functions here were being a little careless when they generated the test cases randomly not taking into account relationships that should exist between the various inputs, but they were just assuming that we’d write the algorithms in the simpler way and only tested their graders with the algorithm as they expected we’d write it. E.g. using tanh'(z) = (1 - a^2) in the above example and in the C5 W1 A1 case.
Speaking of complicated math and truncation errors, this year’s “calculation of pi by hand” video on Matt Parker’s YouTube channel (Stand Up Maths) is quite interesting.
He had 200 volunteers for a week calculating the decimal digits of pi, by computing the sums of several Taylor series approximations of arctan, to great precision.