Why do we need to add da_prevt to the upstream gradient, i.e. da_next = da[:,:,t] + da_prevt not just da[:,:,t]? I donโt understand the math here. This is for rnn_backward function. Why is da_prevt being added to the upstream gradient da[:,:,t]? What does upstream gradient actually mean in this case, since we are not computing the derivative from y at time step t ourselves?

this is my personal idea, thought iโm not sure. Consider this forward pass: ๐ฆฬ โจ๐กโฉ=๐ ๐๐๐ก๐๐๐ฅ(๐๐ฆ๐๐โจ๐กโฉ+๐๐ฆ) That is where the actual da[:,:,t] comes from. The da_prevt is from the previous block.
Anyway i got a broadcast problem with dba when I initialized it with shape (n_a, 1)
non-broadcastable output operand with shape (5,1) doesnโt match the broadcast shape (5,5)
I can fix this by initializing dba to be (n_a, ) but that seems unstable.

My question is that why are we adding da_prevt to da[:,:,t] while passing to rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) why not just pass da[:,:,t] like this rnn_cell_backward(da[:,:,t],caches[t])

Remember that you still have da computed when we use a_next to calculate y_hat .

And the idea for the " + " here is that a_next is used to calculate both y_hat and for the next block and when calculating derivate, we have to take into account both of them.

After some thinking this is my understanding: keep in mind that we are differentiating the sum of the losses across time, and the t-th loss is affected just by the first t iterations, not from the iterations from t+1 onwards.
My take is that, at step t, da_prect takes into account the derivatives of all the losses computed from t+1 onwards, whereas da [:,:,t] takes into account the derivative of the single loss at time t (the one we do not actually compute, but it is given).
Being the loss a sum of the single losses, also its differential will be the sum of the differentials of the single losses

I think Antonio is correct here as it was shown in the graph from the notebook where partial derivative of a_next is affected by two thingsโloss function at time step t and backprops from t+1 onwards.

The notebook writes in the 3rd paragraph in section โ3 - Backpropagation in Recurrent neural Networks (OPTIONAL/UNGRADED)โ:

Note that this notebook does not implement the backward path from the Loss โJโ backwards to โaโ. This would have included the dense layer and softmax, which are a part of the forward path. This is assumed to be calculated elsewhere and the result passed to rnn_backward in da.

This da shown in the pink box in the figure is for the path starting from \hat{y}^{\langle t \rangle} (or the loss \mathcal{L}^{\langle t \rangle}(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle}) there) to a^{\langle t \rangle}. So we are using the slice for time t here: da[:,:,t]. The total da should be the sum of this da and the part passing from the time step (t+1) (that is, the output of the rnn_cell_backward call at time t+1. So they are added here.

Why do we add these two? Letโs say that we want to compute \frac{\partial J}{\partial a} and that there are two variables p and q for which the effect of a is known. Then the chain rule says to add the two components: \frac{\partial J}{\partial a} = \frac{\partial J}{\partial p} \frac{\partial p}{\partial a} + \frac{\partial J}{\partial q} \frac{\partial q}{\partial a}. For our example, one of them is for the loss at t and the other is for the part from time step t+1.

I really like the question, since it points to the part of the exercise I found most interesting: not backpropagating through a computation tree, but through a (directed) computation graph. My mental model is this, with the derivative of a sum being the sum of the derivatives (*):

Since \mathcal{L} is the sum of all \mathcal{L}^{<t>} and (*), we can compute the derivatives for \mathcal{L}^{<t>} and sum up the results.

We could compute the derivatives for each \mathcal{L}^{<t>} individually, which would correspond to not using rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]), but rnn_cell_backward(da[:,:,t],caches[t]) + rnn_cell_backward(da_prevt,caches[t]). But that would mean we unwind the computation graph into a computation tree and compute the joint steps backward through time repeatitively (leading to O(c^2) instead of c backprop steps, with c being the number of elements in the computation graph).

We again use (*) (but the other way around): we can add up the elements that have joint steps backward through time because the derivatives of each \mathcal{L}^{<t>} would need to be summed up at the end anyway. So using rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) leads to only c backprop steps and what comes out at the end is already the sum (e.g. dWaa[0] is d\mathcal{L}/dWaa}).

In summary, going backward through time during backprop, we use

(*) from left to right: branch out from d\mathcal{L} to T_y derivatives d\mathcal{L}^{<t>}

(*) from right to left: join branches via rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) instead of using rnn_cell_backward(da[:,:,t],caches[t]) + rnn_cell_backward(da_prevt,caches[t])

you can refer to this 9.7. Backpropagation Through Time โ Dive into Deep Learning 1.0.3 documentation
key point is that Loss function is the sum of all time steps, y_t denpends on a_t, and y_{t+1} depends on a_{t+1} while a_{t+1} also depends on a_t. Thus L depends on a_t through y_t and a_{t+1} . L ~f(y_t,a_{t+1}), y_t ~f(a_t), a_{t+1} ~f(a_t}. And you apply the chain rule of multivariate function.