I remembered Prof. Ng mentioned he almost always use Relu as the activation function. Therefore, when I worked on the assignment Planar_data_classification_with_one_hidden_layer, I tried replacing sigmoid function with Relu. But when I run the nn_model function, the values in the A1 matrix, become all Nan after it iterates 50 times. Then I tried LRelu, I got the same phenomena. Apparently, Relu and LRelu does not work. Why Prof. Ng said he almost always use Relu as the activation function?
Here is the forward prop function I used. I changed tanh() to LRelu().

It is perfectly possible to make ReLU or Leaky ReLU work in this exercise, but the key point to realize is that the activation function does not only affect the forward propagation, right? You also need to take it into account in the back propagation process.

@paulinpaloalto Thanks a lot for the reply! So for this case, I have to work out the backward propagation equations through some computation, right? I still don’t quite get how those equations are worked out after I watched Prof. Ng’s explanation. Can you point to some books or resources that explain the computation for me?

The fully general formulas are given in the notebook here. Note that g^{[1]}() is the activation function at layer 1. Here’s one of the formulas given in the notebook:

@paulinpaloalto Hi, I have changed the derivative of the activation function. The following is the equation. Is this correct?
dZ1 = np.dot(W2.T , dZ2)* np.where(Z1>0, 1, 0.01)
But after I made the change, the accuracy became 71% at most. It couldn’t get better no matter what the number of iteration was.

I tried various combinations and was able to get the the accuracy upto 85% by using n_h=277, α=0.073 and 50000 iterations.
It took a couple of hours to train.

Thanks very much for sharing your results here. I also did a few experiments using just plain ReLU (as opposed to Leaky ReLU). I was able to get 81% accuracy with n_h = 40, \alpha = 0.6 and 12k iterations.

It’s interesting that it seems quite a bit easier to get good results using tanh here on this particular problem. The general rule is that there is no one magic recipe for what will work best in any given situation.

@paulinpaloalto thanks for sharing. I tried the same setting, i.e., n_h = 40, α = 0.6 and 12k iterations, got 85% accuracy, not as good as tanh or sigmoid (which can get 89% accuracy with the default setting).

Hi, @taohaoxiong. Thanks for trying the experiments and sharing your results! I reran my test and I also get 85% accuracy now, using ReLU with the other hyperparameters that you list. I also tried Leaky ReLU with 0.2, 0.1 and 0.05 slopes. I was able to get pretty similar accuracies.

I also tried a few experiments with the original tanh code and was able to get better accuracy on the higher numbers of hidden nodes (n = 20 and n = 40) and was able to get higher than the 91% accuracy by using higher numbers of iterations.