Regarding derivative of ReLu activation function

As per the lecture Slide, in defining the derivative of the Relu function, putting “A1 >= 0” does not give good results and the boundary seems almost linear, but if remove the equality sign i.e. “A1 > 0” then performance increase drastically. Is this somewhat expected? or am I doing something incorrectly?

without “=” sign for grad
dZ1 = np.dot(W2.T, dZ2) * (np.ones((A1.shape)) * (A1 > 0))

image

with “=” sign
dZ1 = np.dot(W2.T, dZ2) * (np.ones((A1.shape)) * (A1 >= 0))


image

Hi, Siddhesh.

It’s great that you are doing experiments like this! It’s always a good learning experience when you try to extend the ideas in the course. You’ve discovered something pretty interesting here. I do not have an explanation yet, but my results are the same as yours. I had already implemented this using Z > 0 and got good results. When I tried >= 0, I also get much worse results. So I think your code is correct, but we’ve got an unexplained phenomenon on our hands. Of course the issue is that ReLU is not differentiable at Z = 0, so (as Prof Ng comments in the lecture) you can get around that by just using one of the limit values as the derivative at Z = 0. The really surprising thing is that the choice of > versus >= makes such a big difference. <Update: this analysis is wrong. See the later replies on this thread.>

FWIW here’s another earlier thread about using sigmoid and ReLU for the hidden layer in the Planar Data exercise, but apparently we all used the Z > 0 formulation there for the ReLU derivative.

So the bottom line is that you’ve found something pretty interesting which appears to disagree with the formula Prof Ng shows in the lectures and needs more thought and investigation. Actually my next step is to check how the ReLU derivatives are computed in the Week 4 exercise.

For what it’s worth, in the Week 4 assignments we use ReLU for all the hidden layer activations. They give us the relu_backward function as a provided utility function and it turns out they do use the Z > 0 method. Well, actually they sort of invert the calculation and set the derivative to 0 for Z <= 0, but it’s equivalent to what we are calling the “greater than 0” solution above.

So it appears that the code differs from what Prof Ng shows in the lectures. Interesting! Both because it’s different and because this apparently subtle difference makes such a big difference in actual convergence.

Oh, wait. My analysis above is wrong. It turns out that your code is wrong. Notice that what you wrote is not equivalent to what Prof Ng wrote. He wrote:

g'(Z) = 1 if Z >= 0

But you wrote the equivalent of:

g'(Z) = 1 if A >= 0

But A is the output of ReLU, right? So all values of A are 0 or greater. So your formulation is equivalent to saying:

g'(Z) = 1 \forall Z

That is why it doesn’t work correctly. You have to rewrite it in terms of Z as the input variable. If you rewrite things that way, then what you find is that either Z > 0 or Z >= 0 give the same results.

Whew! It just didn’t feel right that it made that big a difference, so it’s nice to finally have an explanation.

1 Like

Hi,

Thanks a lot.
It’s clear now and understand why code works when A > 0 only. I have forgotten the fact that A is the output of relu. It is better to keep writing w.r.t "Z " for derivatives of both cases of ReLu as a good practice.