Why apply relu on L1 (L1 = np.dot(W2.T, (yhat - y))) instead oof on the activations?

What I understand is, the error on a layer is:

E = np.dot(weights_to_next_later, error_on_next_layer) * activation_grad(current_layer)

but on the week’s assignment for week 4 of the Probabilistic model, we apply relu on the dot product.
I understand that the gradient of relu is either 0 or 1 but that value must come from the activations from the hidden layer in this case.

Any thoughts?

Hi Rounak_Shrestha,

This results from calculating the backpropagation for the CBOW model. You can for instance find such calculations here or here.