Course1 - Week3 Assignment - ReLU gave worse performance than tanh


I tried the ungraded exercises and implemented ReLU (as well as Leaky ReLU) as the activation function for the hidden layer. On this assignment, it turned out that ReLU performed far worse than with tanh.

I did a quick Google search and it seems ReLU is recommended over tanh and sigmoid for simple neural networks for computational efficiency.

Did anyone try and what is your observation? Is ReLU supposed to be inferior to tanh in this particular case?

Thank you!

You should be able to get pretty decent performance from ReLU on this exercise, but you may need to fiddle with the learning rate and number of iterations. It also doesn’t do so well with small numbers of neurons in the hidden layer. I was able to get 81% accuracy with n_h = 40, \alpha = 0.6 and 12k iterations.

One important thing to check is that you did the complete and correct implementation: note that you need to change more than just forward propagation, right? The derivative of the activation functions is part of back propagation, so that needs to change as well.


I guess we should modify the function and also g[1]’(Z[1]) , but how can we implement its derivation on dZ[1] ?

Well, what is the formula for dZ^{[1]}?

dZ^{[1]} = \left ( W^{[2]T} \cdot dZ^{[2]} \right ) * g^{[1]'}(Z^{[1]})

So (as you say), you need the formula for the derivative of ReLU. You have this:

g(Z) = max(0, Z)

Which means Z if Z >= 0 and 0 if Z < 0. So what would the derivative of that look like? It would be 0 if Z < 0 and 1 if Z >= 0, right? I can write that in one easy line of python and numpy.