How does relu appears in first layer gradient of backpropagation?

I’m following Stanford’s Natural language processing course in Coursera. I’m learning about “Continuous bag of words” model Where neural network with one relu(first layer) and one softmax(second layer) is involved. The gradient of the J with respect to W1 somehow goes like this:

δJ/δW1=1/m∗Relu((W2)^T(Yhat−Y))XT

But according to the general formula of backpropagation(Andrew Ng’s formula), isn’t that supposed to be:

δJ/δW1=1/m∗((W2)^T(Yhat−Y)XT⊙[Z1>0])

, where [Z1>0] is just the derivative of the first layer and ⊙ is the element-wise product

My question: How are these two equivalent? I tried so many ways to connect these!

Hello @Duc_Gia ,
Welcome to Discourse Community, thanks a lot for this question. In my reply, I will do my best to explain why these two equations are equivalent.

The two equations are equivalent because the ReLU function is piecewise linear. The ReLU function is defined as follows:

ReLU(x) = max(0, x)

The derivative of the ReLU function is:

ReLU'(x) = 1 if x > 0
ReLU'(x) = 0 if x <= 0

The first equation for the gradient of J with respect to W1 can be written as follows:

δJ/δW1 = 1/m * ReLU((W2)^T(Yhat - Y))XT

The ReLU function is applied element-wise to the matrix (W2)^T(Yhat - Y). This means that the ReLU function is applied to each element of the matrix. The elements of the matrix that are greater than 0 will be multiplied by 1, and the elements of the matrix that are less than or equal to 0 will be multiplied by 0. This is equivalent to the element-wise product of the matrix (W2)^T(Yhat - Y) and the matrix [Z1>0], where [Z1>0] is a matrix of zeros and ones, where the ones indicate the elements of Z1 that are greater than 0.

The second equation for the gradient of J with respect to W1 can be written as follows:

δJ/δW1 = 1/m * ((W2)^T(Yhat - Y)XT⊙[Z1>0])

The element-wise product of the matrix (W2)^T(Yhat - Y) and the matrix [Z1>0] is equivalent to the ReLU function applied element-wise to the matrix (W2)^T(Yhat - Y).

Therefore, the two equations are equivalent.

I hope I was able to help you with my derivation. Please feel free to ask a followup question if you feel unsure.
Best,
Can

1 Like

But it’s [(W2)^T(Yhat - Y) > 0], right? Am I understanding correctly? The second equation tells us that (W2)^T(Yhat - Y) multiplied by [Z1 > 0] not by [(W2)^T(Yhat - Y) > 0]. Your explanation still makes me confused. Thanks for your time, anyway!

Dear @Duc_Gia ,
Yes, you are correct. The matrix [Z1>0] is a matrix of zeros and ones, where the ones indicate the elements of Z1 that are greater than 0. Therefore, the second equation for the gradient of J with respect to W1, which is δJ/δW1=1/m∗((W2)^T(Yhat−Y)XT⊙[Z1>0]), tells us that (W2)^T(Yhat - Y) is multiplied element-wise by [Z1 > 0], not by [(W2)^T(Yhat - Y) > 0]. The explanation in the previous answer may have been confusing, but the correct matrix to be multiplied element-wise with (W2)^T(Yhat - Y) is indeed [Z1 > 0].
Regards,
Can