Week 1 Assignment 1 Backpropagation

Why do we need to add da_prevt to the upstream gradient, i.e. da_next = da[:,:,t] + da_prevt not just da[:,:,t]? I don’t understand the math here. This is for rnn_backward function. Why is da_prevt being added to the upstream gradient da[:,:,t]? What does upstream gradient actually mean in this case, since we are not computing the derivative from y at time step t ourselves?

gradients = rnn_cell_backward(da[:,:,t]+da_prevt,caches[t])

8 Likes

Do you still have a doubt about this issue?

Yes, though I completed the assignment, I still have this question on the theory.

1 Like

I also struggle with understanding, could you please explain why it should be done like this?

1 Like

this is my personal idea, thought i’m not sure. Consider this forward pass: 𝑦̂ ⟨𝑡⟩=𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑦𝑎𝑎⟨𝑡⟩+𝑏𝑦) That is where the actual da[:,:,t] comes from. The da_prevt is from the previous block.
Anyway i got a broadcast problem with dba when I initialized it with shape (n_a, 1)
non-broadcastable output operand with shape (5,1) doesn’t match the broadcast shape (5,5)
I can fix this by initializing dba to be (n_a, ) but that seems unstable.

1 Like

Hi, I have a doubt about this issue. Please help.

1 Like

Can you give a little more information about your doubt?

My question is that why are we adding da_prevt to da[:,:,t] while passing to rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) why not just pass da[:,:,t] like this rnn_cell_backward(da[:,:,t],caches[t])

1 Like

Remember that you still have da computed when we use a_next to calculate y_hat image.

And the idea for the " + " here is that a_next is used to calculate both y_hat and for the next block and when calculating derivate, we have to take into account both of them.

Does this explanation clarify the matter?

9 Likes

Try setting keepdims=True in np.sum() in rnn_cell_backward()
Had the same issue before.

1 Like

After some thinking this is my understanding: keep in mind that we are differentiating the sum of the losses across time, and the t-th loss is affected just by the first t iterations, not from the iterations from t+1 onwards.
My take is that, at step t, da_prect takes into account the derivatives of all the losses computed from t+1 onwards, whereas da [:,:,t] takes into account the derivative of the single loss at time t (the one we do not actually compute, but it is given).
Being the loss a sum of the single losses, also its differential will be the sum of the differentials of the single losses

5 Likes

I think Antonio is correct here as it was shown in the graph from the notebook where partial derivative of a_next is affected by two things–loss function at time step t and backprops from t+1 onwards.

3 Likes

The notebook writes in the 3rd paragraph in section ‘3 - Backpropagation in Recurrent neural Networks (OPTIONAL/UNGRADED)’:

Note that this notebook does not implement the backward path from the Loss ‘J’ backwards to ‘a’. This would have included the dense layer and softmax, which are a part of the forward path. This is assumed to be calculated elsewhere and the result passed to rnn_backward in da.

This da shown in the pink box in the figure is for the path starting from \hat{y}^{\langle t \rangle} (or the loss \mathcal{L}^{\langle t \rangle}(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle}) there) to a^{\langle t \rangle}. So we are using the slice for time t here: da[:,:,t]. The total da should be the sum of this da and the part passing from the time step (t+1) (that is, the output of the rnn_cell_backward call at time t+1. So they are added here.

Why do we add these two? Let’s say that we want to compute \frac{\partial J}{\partial a} and that there are two variables p and q for which the effect of a is known. Then the chain rule says to add the two components: \frac{\partial J}{\partial a} = \frac{\partial J}{\partial p} \frac{\partial p}{\partial a} + \frac{\partial J}{\partial q} \frac{\partial q}{\partial a}. For our example, one of them is for the loss at t and the other is for the part from time step t+1.

4 Likes

I really like the question, since it points to the part of the exercise I found most interesting: not backpropagating through a computation tree, but through a (directed) computation graph. My mental model is this, with the derivative of a sum being the sum of the derivatives (*):

  1. Since \mathcal{L} is the sum of all \mathcal{L}^{<t>} and (*), we can compute the derivatives for \mathcal{L}^{<t>} and sum up the results.
  2. We could compute the derivatives for each \mathcal{L}^{<t>} individually, which would correspond to not using rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]), but rnn_cell_backward(da[:,:,t],caches[t]) + rnn_cell_backward(da_prevt,caches[t]). But that would mean we unwind the computation graph into a computation tree and compute the joint steps backward through time repeatitively (leading to O(c^2) instead of c backprop steps, with c being the number of elements in the computation graph).
  3. We again use (*) (but the other way around): we can add up the elements that have joint steps backward through time because the derivatives of each \mathcal{L}^{<t>} would need to be summed up at the end anyway. So using rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) leads to only c backprop steps and what comes out at the end is already the sum (e.g. dWaa[0] is d\mathcal{L}/dWaa}).

In summary, going backward through time during backprop, we use

  • (*) from left to right: branch out from d\mathcal{L} to T_y derivatives d\mathcal{L}^{<t>}
  • (*) from right to left: join branches via rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) instead of using rnn_cell_backward(da[:,:,t],caches[t]) + rnn_cell_backward(da_prevt,caches[t])
1 Like

Good observation, David.

The time stamp is all the hitch, that creates the differences over the period of time.

Yes, I found backprop with recurrence quite difficult to grasp. I created this Figure for a single time step:

3 Likes

you can refer to this 9.7. Backpropagation Through Time — Dive into Deep Learning 1.0.3 documentation
key point is that Loss function is the sum of all time steps, y_t denpends on a_t, and y_{t+1} depends on a_{t+1} while a_{t+1} also depends on a_t. Thus L depends on a_t through y_t and a_{t+1} . L ~f(y_t,a_{t+1}), y_t ~f(a_t), a_{t+1} ~f(a_t}. And you apply the chain rule of multivariate function.

1 Like

Thanks David for the helpful intuition. Another way to look at the redundant method:

When I was deriving the backprop from scratch, my intuition was to traverse from every outer layer to the first cell (i.e. dL[T] / da[1], dL[T-1] / da[1], …), and then sum them up.

Each color represents 1 traversal:

Essentially, instead of adding da’s while you traverse (and only traversing once), you traverse Tx times (starting from each y) and then add gradients.

In quasi-code:

for t2 in (Tx, …, 1):

da_prevt = da[:,:,t2] # softmax outer layer
for t in (t2, …, 1):

run_cell_backwards(da_prevt, caches[t])
increment all gradients

I confirmed that the gradients at the end match, and that the runtime scales by Tx^2 (due to nested for loops) instead of Tx.

Glad I could reconcile my slower (but more intuitive) way with the more efficient method. Hope this is helpful for others.

2 Likes