Why do we need to add da_prevt to the upstream gradient, i.e. da_next = da[:,:,t] + da_prevt not just da[:,:,t]? I don’t understand the math here. This is for rnn_backward function. Why is da_prevt being added to the upstream gradient da[:,:,t]? What does upstream gradient actually mean in this case, since we are not computing the derivative from y at time step t ourselves?

this is my personal idea, thought i’m not sure. Consider this forward pass: 𝑦̂ ⟨𝑡⟩=𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑦𝑎𝑎⟨𝑡⟩+𝑏𝑦) That is where the actual da[:,:,t] comes from. The da_prevt is from the previous block.
Anyway i got a broadcast problem with dba when I initialized it with shape (n_a, 1)
non-broadcastable output operand with shape (5,1) doesn’t match the broadcast shape (5,5)
I can fix this by initializing dba to be (n_a, ) but that seems unstable.

My question is that why are we adding da_prevt to da[:,:,t] while passing to rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) why not just pass da[:,:,t] like this rnn_cell_backward(da[:,:,t],caches[t])

Remember that you still have da computed when we use a_next to calculate y_hat .

And the idea for the " + " here is that a_next is used to calculate both y_hat and for the next block and when calculating derivate, we have to take into account both of them.

After some thinking this is my understanding: keep in mind that we are differentiating the sum of the losses across time, and the t-th loss is affected just by the first t iterations, not from the iterations from t+1 onwards.
My take is that, at step t, da_prect takes into account the derivatives of all the losses computed from t+1 onwards, whereas da [:,:,t] takes into account the derivative of the single loss at time t (the one we do not actually compute, but it is given).
Being the loss a sum of the single losses, also its differential will be the sum of the differentials of the single losses

I think Antonio is correct here as it was shown in the graph from the notebook where partial derivative of a_next is affected by two things–loss function at time step t and backprops from t+1 onwards.

The notebook writes in the 3rd paragraph in section ‘3 - Backpropagation in Recurrent neural Networks (OPTIONAL/UNGRADED)’:

Note that this notebook does not implement the backward path from the Loss ‘J’ backwards to ‘a’. This would have included the dense layer and softmax, which are a part of the forward path. This is assumed to be calculated elsewhere and the result passed to rnn_backward in da.

This da shown in the pink box in the figure is for the path starting from \hat{y}^{\langle t \rangle} (or the loss \mathcal{L}^{\langle t \rangle}(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle}) there) to a^{\langle t \rangle}. So we are using the slice for time t here: da[:,:,t]. The total da should be the sum of this da and the part passing from the time step (t+1) (that is, the output of the rnn_cell_backward call at time t+1. So they are added here.

Why do we add these two? Let’s say that we want to compute \frac{\partial J}{\partial a} and that there are two variables p and q for which the effect of a is known. Then the chain rule says to add the two components: \frac{\partial J}{\partial a} = \frac{\partial J}{\partial p} \frac{\partial p}{\partial a} + \frac{\partial J}{\partial q} \frac{\partial q}{\partial a}. For our example, one of them is for the loss at t and the other is for the part from time step t+1.

I really like the question, since it points to the part of the exercise I found most interesting: not backpropagating through a computation tree, but through a (directed) computation graph. My mental model is this, with the derivative of a sum being the sum of the derivatives (*):

Since \mathcal{L} is the sum of all \mathcal{L}^{<t>} and (*), we can compute the derivatives for \mathcal{L}^{<t>} and sum up the results.

We could compute the derivatives for each \mathcal{L}^{<t>} individually, which would correspond to not using rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]), but rnn_cell_backward(da[:,:,t],caches[t]) + rnn_cell_backward(da_prevt,caches[t]). But that would mean we unwind the computation graph into a computation tree and compute the joint steps backward through time repeatitively (leading to O(c^2) instead of c backprop steps, with c being the number of elements in the computation graph).

We again use (*) (but the other way around): we can add up the elements that have joint steps backward through time because the derivatives of each \mathcal{L}^{<t>} would need to be summed up at the end anyway. So using rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) leads to only c backprop steps and what comes out at the end is already the sum (e.g. dWaa[0] is d\mathcal{L}/dWaa}).

In summary, going backward through time during backprop, we use

(*) from left to right: branch out from d\mathcal{L} to T_y derivatives d\mathcal{L}^{<t>}

(*) from right to left: join branches via rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) instead of using rnn_cell_backward(da[:,:,t],caches[t]) + rnn_cell_backward(da_prevt,caches[t])

you can refer to this 9.7. Backpropagation Through Time — Dive into Deep Learning 1.0.3 documentation
key point is that Loss function is the sum of all time steps, y_t denpends on a_t, and y_{t+1} depends on a_{t+1} while a_{t+1} also depends on a_t. Thus L depends on a_t through y_t and a_{t+1} . L ~f(y_t,a_{t+1}), y_t ~f(a_t), a_{t+1} ~f(a_t}. And you apply the chain rule of multivariate function.

Thanks David for the helpful intuition. Another way to look at the redundant method:

When I was deriving the backprop from scratch, my intuition was to traverse from every outer layer to the first cell (i.e. dL[T] / da[1], dL[T-1] / da[1], …), and then sum them up.

Essentially, instead of adding da’s while you traverse (and only traversing once), you traverse Tx times (starting from each y) and then add gradients.

In quasi-code:

for t2 in (Tx, …, 1):

da_prevt = da[:,:,t2] # softmax outer layer
for t in (t2, …, 1):

run_cell_backwards(da_prevt, caches[t])
increment all gradients

I confirmed that the gradients at the end match, and that the runtime scales by Tx^2 (due to nested for loops) instead of Tx.

Glad I could reconcile my slower (but more intuitive) way with the more efficient method. Hope this is helpful for others.