Incorrect Backprop Equations NLP Course 2 Week 4

Hi,

Watching the lecture videos for NLP course 2, Week 4, I notice an error in the backprop equations given in the videos. The video is " Training a CBOW Model: Backpropagation and Gradient Descent"

The equations say ReLU(W_2^T (y_hat - h) ) X^T.

If I am not mistaken, the correct equation should have RELU_Prime(z_1) * W_2^T (y_hat-y) X^T , aka either 0 or 1 depending on the linear activation z_1 in the forward layer in the forward pass.

I’m pretty sure backprop through ReLU should give a RELU_Prime the gradient of the activation, not a RELU() in the backpropogation pass , right ?

Or am I misunderstanding the equations?

Hey @Carl_Merrigan,
That’s an intriguing question. Quite frankly, I was stumped for sometime as to from where should I begin to explain from, but let’s start anyways. Before that, let’s state what we already know:

\frac{\partial{J}}{\partial{H}} = \frac{1}{m} W_2^T(\hat{Y} - Y), \\ \frac{\partial{Z_1}}{\partial{W_1}} = X, \\ \frac{\partial{J}}{\partial{W_1}} = \frac{\partial{J}}{\partial{H}} \frac{\partial{H}}{\partial{Z_1}} \frac{\partial{Z_1}}{\partial{W_1}}

Now, the thing in focus is \frac{\partial{H}}{\partial{Z_1}}, where, we already know that H = ReLU(Z_1). We know that:

Z_1 > 0; H = Z_1; \frac{\partial{H}}{\partial{Z_1}} = 1 \\ Z_1 <= 0; H = 0; \frac{\partial{H}}{\partial{Z_1}} = 0

Now, the above derivatives are conditioned on Z_1, but using simple deductions, I can re-write these derivatives conditioned on H, as belows:

H > 0; \frac{\partial{H}}{\partial{Z_1}} = 1 \\ H = 0; \frac{\partial{H}}{\partial{Z_1}} = 0

Now, as per ReLU’s definition, H can’t be less than 0, so, it won’t hurt, if I would change the second condition above from H = 0 to H <= 0, so, re-writing the derivatives, we get:

H > 0; \frac{\partial{H}}{\partial{Z_1}} = 1 \\ H <= 0; \frac{\partial{H}}{\partial{Z_1}} = 0

Now, if we take a close look at the conditions above, we will find that the derivatives are based upon ReLU(H), where H happens to be the output of ReLU. Bringing upon an analogy of backward-propagation, we can state that \frac{\partial{J}}{\partial{H}} is the output of ReLU, and \frac{\partial{Z_1}}{\partial{W_1}} is the input. So, we can define the derivative \frac{\partial{H}}{\partial{Z_1}}, based on ReLU(\frac{\partial{J}}{\partial{H}}).

This is something which I guess could be one of the reasons, but I believe that there should be an easier and more intuitive explanation to this. Let me tag in another mentor to shine some light here. Hey @arvyzukai, can you please take a look at this thread, and let us know your take on this? Thanks in advance.

Cheers,
Elemento

Hey @Elemento

I had a second look and I think everything is ok and as you explained - we reuse ReLU for \frac {\partial{h}}{\partial{z_1}} calculation (and this is the part @Carl_Merrigan was concerned about). The local derivative wrt W_1 is just the dot with X^T and with the chain rule we get what is in the video.

Maybe @reinoudbosch could confirm that.

Thanks , @Elemento

I see what you are saying that with H = RELU(z_1) , dH / dZ1 can be conditioned on Z1 > 0 or H > 0.

I still get lost as to why a term RELU ( dJ / dH ) would appear , though ?

Let’s say the relu was replaced with a sigmoid (H = SIGMOID(Z_1), I claim then the formula would be :

dJ / dW1 = SIGMOID_Prime(z_1) * W_2^T (y_hat-y) X^T = SIGMOID(z_1)(1-SIGMOID(z_1) * W_2^T (y_hat-y) X^T or H(1-H) dJ / dH X^T

With H = tanh(Z_1), it would become (1-H**2) dJ/dH X^T

With H = Z_1, just linear activation , it would be just dJ/dH X^T

My point is that in all three of the above, there is never a nonlinear function acting on dJ/dH

I went on to the weeks lab, which has us use the formula’s from the lectures to do the backprop calculation. The gradient descent algorithm works as expected, with the loss going down in the lab, which suggests that the lecture equations do make sense or work.

My only way to rationalize it , is that the inputs z_1 come from averaged one_hot vectors X, which are all positive, and W1, which is initialized with uniform elements between 0 and 1, so Z_1 is typically positive. Similarly dJ/dH , comes from W2^T (y_hat - y), which is plausible to guess will be positive since y_hats are softmax probabilities and y is zero except for the index of the true label.

So my suspiscion is that the term ReLU(dJ/dH) is just passing the error backwards linearly, and the gradient is effectively acting the same as the gradient of a linear activation in the hidden layer

ReLU(W_2^T (y_hat - h) ) X^T is becoming just W_2^T (y_hat - h) X^T

When I have some time, I will try to download the lab and test if my hunch is right.

Hi Carl_Merrigan,

Great question. It could be rephrased as: why would negative values of \frac{\partial{J}}{\partial{H}} be cut off?
As there does not seem to be a reason why, your argument that the fact that it works may be due to positive values of \frac{\partial{J}}{\partial{H}} makes sense to me. That this implies a linear activation is consistent with an important implementation by Xin Rong (https://arxiv.org/pdf/1411.2738.pdf).

Hey @Carl_Merrigan,
To be honest, I am a little lost here, myself, since I can’t find a concrete mathematical derivation behind this equation that I could present to you. What I wrote in my reply, is just what I believe could be one of the reasons.

As to W_1 and W_2 being positive, they are only initialized with positive values. There is no way for us to state that they will stay positive throughout the course of the back-propagation. And as you mentioned, even (\hat{y} - y), will be negative at least for the index containing the true label.

Let me try out one thing. I will try to modify the back_prop function, to include the equations that we believe could replace these equations, and I will observe whether the cost decreases over the iterations or not. I will revert back to this thread as soon as I have the results.

Cheers,
Elemento

Hey @Carl_Merrigan, @arvyzukai and @reinoudbosch,
The results are indeed intriguing, and pretty much same to the expected output. Let me mention both of them here, for your reference.

Received Output (After Modifying the Back-prop Equations)

Call gradient_descent
iters: 10 cost: 11.809219
iters: 20 cost: 3.615004
iters: 30 cost: 9.307969
iters: 40 cost: 1.616883
iters: 50 cost: 9.013010
iters: 60 cost: 10.843635
iters: 70 cost: 6.548513
iters: 80 cost: 10.852283
iters: 90 cost: 9.712245
iters: 100 cost: 11.470117
iters: 110 cost: 4.041459
iters: 120 cost: 4.152883
iters: 130 cost: 10.796167
iters: 140 cost: 5.966528
iters: 150 cost: 2.070231

Expected Output

iters: 10 cost: 11.714748
iters: 20 cost: 3.788280
iters: 30 cost: 9.179923
iters: 40 cost: 1.747809
iters: 50 cost: 8.706968
iters: 60 cost: 10.182652
iters: 70 cost: 7.258762
iters: 80 cost: 10.214489
iters: 90 cost: 9.311061
iters: 100 cost: 10.103939
iters: 110 cost: 5.582018
iters: 120 cost: 4.330974
iters: 130 cost: 9.436612
iters: 140 cost: 6.875775
iters: 150 cost: 2.874090

Modified Backprop Equations

z1 = np.matmul(W1, x) + b1
l1 = np.matmul(W2.T, (yhat - y))
l1[z1 < 0] = 0 # Shape = (N, m)

Undoubtedly, these equations will fail the grader, but I guess, we can now claim that even if we use the equations that you mentioned, there is no issue per se. I hope this resolves your query.

Cheers,
Elemento

2 Likes

Thanks for running the check! @Elemento

@reinoudbosh thanks for linking to the paper, I will take a look

I guess most people taking the course would not notice or be bothered by the equation and just complete the assignment as directed, but it would be nice if some note or correction were added to the slides in the course.

Thanks for the responses everyone , I will forge ahead with the next courses.

Hey @Carl_Merrigan,
Sure, let me raise an issue regarding this, and if the team deems it to be fit, they will modify the slides/assignment accordingly.

Cheers,
Elemento

1 Like

Hey @Carl_Merrigan,
The team has verified your changes, and is now working to fix the lecture video as well as the assignment. It might take some time for the changes to reflect on Coursera. Once again, thanks a lot for your contributions to the community.

Cheers,
Elemento

1 Like

Sounds great ! Thanks for brining it to their attention.

Hey @Carl_Merrigan,
The assignment and the lecture videos have been updated. Once again, thanks a lot for your contributions.

Cheers,
Elemento