Regarding derivative of ReLu activation function

Hi, Siddhesh.

It’s great that you are doing experiments like this! It’s always a good learning experience when you try to extend the ideas in the course. You’ve discovered something pretty interesting here. I do not have an explanation yet, but my results are the same as yours. I had already implemented this using Z > 0 and got good results. When I tried >= 0, I also get much worse results. So I think your code is correct, but we’ve got an unexplained phenomenon on our hands. Of course the issue is that ReLU is not differentiable at Z = 0, so (as Prof Ng comments in the lecture) you can get around that by just using one of the limit values as the derivative at Z = 0. The really surprising thing is that the choice of > versus >= makes such a big difference. <Update: this analysis is wrong. See the later replies on this thread.>

FWIW here’s another earlier thread about using sigmoid and ReLU for the hidden layer in the Planar Data exercise, but apparently we all used the Z > 0 formulation there for the ReLU derivative.

So the bottom line is that you’ve found something pretty interesting which appears to disagree with the formula Prof Ng shows in the lectures and needs more thought and investigation. Actually my next step is to check how the ReLU derivatives are computed in the Week 4 exercise.