Week 1 Assignment 1 Backpropagation

Why do we need to add da_prevt to the upstream gradient, i.e. da_next = da[:,:,t] + da_prevt not just da[:,:,t]? I don’t understand the math here. This is for rnn_backward function. Why is da_prevt being added to the upstream gradient da[:,:,t]? What does upstream gradient actually mean in this case, since we are not computing the derivative from y at time step t ourselves?

gradients = rnn_cell_backward(da[:,:,t]+da_prevt,caches[t])

8 Likes

Do you still have a doubt about this issue?

Yes, though I completed the assignment, I still have this question on the theory.

1 Like

I also struggle with understanding, could you please explain why it should be done like this?

1 Like

this is my personal idea, thought i’m not sure. Consider this forward pass: 𝑦̂ ⟨𝑡⟩=𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑦𝑎𝑎⟨𝑡⟩+𝑏𝑦) That is where the actual da[:,:,t] comes from. The da_prevt is from the previous block.
Anyway i got a broadcast problem with dba when I initialized it with shape (n_a, 1)
non-broadcastable output operand with shape (5,1) doesn’t match the broadcast shape (5,5)
I can fix this by initializing dba to be (n_a, ) but that seems unstable.

1 Like

Hi, I have a doubt about this issue. Please help.

1 Like

Can you give a little more information about your doubt?

My question is that why are we adding da_prevt to da[:,:,t] while passing to rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) why not just pass da[:,:,t] like this rnn_cell_backward(da[:,:,t],caches[t])

1 Like

Remember that you still have da computed when we use a_next to calculate y_hat image.

And the idea for the " + " here is that a_next is used to calculate both y_hat and for the next block and when calculating derivate, we have to take into account both of them.

Does this explanation clarify the matter?

11 Likes

Try setting keepdims=True in np.sum() in rnn_cell_backward()
Had the same issue before.

1 Like

After some thinking this is my understanding: keep in mind that we are differentiating the sum of the losses across time, and the t-th loss is affected just by the first t iterations, not from the iterations from t+1 onwards.
My take is that, at step t, da_prect takes into account the derivatives of all the losses computed from t+1 onwards, whereas da [:,:,t] takes into account the derivative of the single loss at time t (the one we do not actually compute, but it is given).
Being the loss a sum of the single losses, also its differential will be the sum of the differentials of the single losses

5 Likes

I think Antonio is correct here as it was shown in the graph from the notebook where partial derivative of a_next is affected by two things–loss function at time step t and backprops from t+1 onwards.

3 Likes

The notebook writes in the 3rd paragraph in section ‘3 - Backpropagation in Recurrent neural Networks (OPTIONAL/UNGRADED)’:

Note that this notebook does not implement the backward path from the Loss ‘J’ backwards to ‘a’. This would have included the dense layer and softmax, which are a part of the forward path. This is assumed to be calculated elsewhere and the result passed to rnn_backward in da.

This da shown in the pink box in the figure is for the path starting from \hat{y}^{\langle t \rangle} (or the loss \mathcal{L}^{\langle t \rangle}(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle}) there) to a^{\langle t \rangle}. So we are using the slice for time t here: da[:,:,t]. The total da should be the sum of this da and the part passing from the time step (t+1) (that is, the output of the rnn_cell_backward call at time t+1. So they are added here.

Why do we add these two? Let’s say that we want to compute \frac{\partial J}{\partial a} and that there are two variables p and q for which the effect of a is known. Then the chain rule says to add the two components: \frac{\partial J}{\partial a} = \frac{\partial J}{\partial p} \frac{\partial p}{\partial a} + \frac{\partial J}{\partial q} \frac{\partial q}{\partial a}. For our example, one of them is for the loss at t and the other is for the part from time step t+1.

4 Likes

I really like the question, since it points to the part of the exercise I found most interesting: not backpropagating through a computation tree, but through a (directed) computation graph. My mental model is this, with the derivative of a sum being the sum of the derivatives (*):

  1. Since \mathcal{L} is the sum of all \mathcal{L}^{<t>} and (*), we can compute the derivatives for \mathcal{L}^{<t>} and sum up the results.
  2. We could compute the derivatives for each \mathcal{L}^{<t>} individually, which would correspond to not using rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]), but rnn_cell_backward(da[:,:,t],caches[t]) + rnn_cell_backward(da_prevt,caches[t]). But that would mean we unwind the computation graph into a computation tree and compute the joint steps backward through time repeatitively (leading to O(c^2) instead of c backprop steps, with c being the number of elements in the computation graph).
  3. We again use (*) (but the other way around): we can add up the elements that have joint steps backward through time because the derivatives of each \mathcal{L}^{<t>} would need to be summed up at the end anyway. So using rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) leads to only c backprop steps and what comes out at the end is already the sum (e.g. dWaa[0] is d\mathcal{L}/dWaa}).

In summary, going backward through time during backprop, we use

  • (*) from left to right: branch out from d\mathcal{L} to T_y derivatives d\mathcal{L}^{<t>}
  • (*) from right to left: join branches via rnn_cell_backward(da[:,:,t]+da_prevt,caches[t]) instead of using rnn_cell_backward(da[:,:,t],caches[t]) + rnn_cell_backward(da_prevt,caches[t])
1 Like

Good observation, David.

The time stamp is all the hitch, that creates the differences over the period of time.

Yes, I found backprop with recurrence quite difficult to grasp. I created this Figure for a single time step:

3 Likes

you can refer to this 9.7. Backpropagation Through Time — Dive into Deep Learning 1.0.3 documentation
key point is that Loss function is the sum of all time steps, y_t denpends on a_t, and y_{t+1} depends on a_{t+1} while a_{t+1} also depends on a_t. Thus L depends on a_t through y_t and a_{t+1} . L ~f(y_t,a_{t+1}), y_t ~f(a_t), a_{t+1} ~f(a_t}. And you apply the chain rule of multivariate function.

1 Like

Thanks David for the helpful intuition. Another way to look at the redundant method:

When I was deriving the backprop from scratch, my intuition was to traverse from every outer layer to the first cell (i.e. dL[T] / da[1], dL[T-1] / da[1], …), and then sum them up.

Each color represents 1 traversal:

Essentially, instead of adding da’s while you traverse (and only traversing once), you traverse Tx times (starting from each y) and then add gradients.

In quasi-code:

for t2 in (Tx, …, 1):

da_prevt = da[:,:,t2] # softmax outer layer
for t in (t2, …, 1):

run_cell_backwards(da_prevt, caches[t])
increment all gradients

I confirmed that the gradients at the end match, and that the runtime scales by Tx^2 (due to nested for loops) instead of Tx.

Glad I could reconcile my slower (but more intuitive) way with the more efficient method. Hope this is helpful for others.

2 Likes

I found the setup for Exercise 5 to be confusing…and ironically backwards. The quantity in the instructions called dtanh is actually dz. The computations should be done in reverse order. I would suggest that the code in the exercise be written as:

#Forward prop
#za = np.matmul(Waa,a_prev)
#zx = np.matmul(Wax,xt)
#z = za + zx + ba
#a_next = np.tanh(z)

#Back prop
#a_next = np.tanh(z)
dz = None

#z = za + zx + ba
dza = None
dzx = None
dba = None

#zx = np.matmul(Wax,xt)
dWax = None
dxt = None

#za = np.matmul(Waa,a_prev)
dWaa = None
da_prev = None

Inspired by Andrej Karparthy’s tutorial Building makemore Part 4: Become a Backprop Ninja, here is the “backprop ninja” way to set up Exercise 7. Note: some of the forward prop section is uncommented to have defintions for the preactivation z variables.

#Forward prop
#concat = np.concatenate((a_prev,xt),axis=0)
zf = np.matmul(Wf,concat) + bf
#ft = sigmoid(zf)
zi = np.matmul(Wi,concat) + bi
#it = sigmoid(zi)
zc = np.matmul(Wc,concat) + bc
#cct = np.tanh(zc)
#c_next = ft * c_prev + it * cct
zo = np.matmul(Wo,concat) + bo
#ot = sigmoid(zo)
#a_next = ot * np.tanh(c_next)
            
# *** Begin backprop ninja training here ***
#Forward #a_next = ot * np.tanh(c_next)
dot = None
dc_next += None #Use += to add result to existing dc_next value 

#Forward #ot = sigmoid(zo)
dzo = None

#Forward #zo = np.matmul(Wo,concat) + bo  
dWo = None
dconcat = None
dbo = None
    
#Forward #c_next = ft * c_prev + it * cct
dft = None
dc_prev = None
dit = None
dcct = None

#Forward #cct = np.tanh(zc)
dzc = None

#Forward #zc = np.matmul(Wc,concat) + bc
dWc = None
dconcat += None #Use += to add result to existing dconcat value
dbc = None

#Forward #it = sigmoid(zi)
dzi = None

#Forward #zi = np.matmul(Wi,concat) + bi
dWi = None
dconcat += None #Use += to add result to existing dconcat value
dbi = None

#Forward #ft = sigmoid(zf)
dzf = None

#Forward #zf = np.matmul(Wf,concat) + bf
dWf = None
dconcat += None #Use += to add result to existing dconcat value
dbf = None

#Forward #concat = np.concatenate((a_prev,xt),axis=0)
da_prev = None
dxt = None