Relu/LRelu does not work for forward propagation in Planar_data_classification_with_one_hidden_layer

mliang · July 25, 2021, 6:16pm

Dear mentors,

I remembered Prof. Ng mentioned he almost always use Relu as the activation function. Therefore, when I worked on the assignment Planar_data_classification_with_one_hidden_layer, I tried replacing sigmoid function with Relu. But when I run the nn_model function, the values in the A1 matrix, become all Nan after it iterates 50 times. Then I tried LRelu, I got the same phenomena. Apparently, Relu and LRelu does not work. Why Prof. Ng said he almost always use Relu as the activation function?
Here is the forward prop function I used. I changed tanh() to LRelu().

def Relu(x):
    return np.where(x > 0, x, 0)
def LRelu(x):
    return np.maximum(0.01 * x, x)

{moderator edit - solution code removed}

Thanks in advance!

mliang

paulinpaloalto · July 25, 2021, 8:49pm

It is perfectly possible to make ReLU or Leaky ReLU work in this exercise, but the key point to realize is that the activation function does not only affect the forward propagation, right? You also need to take it into account in the back propagation process.

mliang · July 26, 2021, 3:43am

@paulinpaloalto Thanks a lot for the reply! So for this case, I have to work out the backward propagation equations through some computation, right? I still don’t quite get how those equations are worked out after I watched Prof. Ng’s explanation. Can you point to some books or resources that explain the computation for me?

mliang

paulinpaloalto · July 26, 2021, 3:50am

The fully general formulas are given in the notebook here. Note that g^{[1]}() is the activation function at layer 1. Here’s one of the formulas given in the notebook:

dZ^{[1]} = \left ( W^{[2]T} \cdot dZ^{[2]} \right ) * g^{[1]'}(Z^{[1]})

In that formula g^{[1]'}() is the derivative of the activation function for layer 1, right?

If you want to go deeper on the mathematics of back propagation, here’s a thread with several links to material out on the web.

mliang · July 27, 2021, 9:02am

@paulinpaloalto Hi, I have changed the derivative of the activation function. The following is the equation. Is this correct?
dZ1 = np.dot(W2.T , dZ2)* np.where(Z1>0, 1, 0.01)
But after I made the change, the accuracy became 71% at most. It couldn’t get better no matter what the number of iteration was.

paulinpaloalto · July 27, 2021, 3:36pm

That looks correct. Well, maybe you need to try more than 4 neurons in the hidden layer.

Caleb · July 27, 2021, 8:39pm

I tried various combinations and was able to get the the accuracy upto 85% by using n_h=277, α=0.073 and 50000 iterations.
It took a couple of hours to train.

paulinpaloalto · July 27, 2021, 8:52pm

Hi, @Caleb.

Thanks very much for sharing your results here. I also did a few experiments using just plain ReLU (as opposed to Leaky ReLU). I was able to get 81% accuracy with n_h = 40, \alpha = 0.6 and 12k iterations.

It’s interesting that it seems quite a bit easier to get good results using tanh here on this particular problem. The general rule is that there is no one magic recipe for what will work best in any given situation.

taohaoxiong · November 14, 2021, 4:47am

@paulinpaloalto thanks for sharing. I tried the same setting, i.e., n_h = 40, α = 0.6 and 12k iterations, got 85% accuracy, not as good as tanh or sigmoid (which can get 89% accuracy with the default setting).

paulinpaloalto · November 14, 2021, 5:10am

Hi, @taohaoxiong. Thanks for trying the experiments and sharing your results! I reran my test and I also get 85% accuracy now, using ReLU with the other hyperparameters that you list. I also tried Leaky ReLU with 0.2, 0.1 and 0.05 slopes. I was able to get pretty similar accuracies.

I also tried a few experiments with the original tanh code and was able to get better accuracy on the higher numbers of hidden nodes (n = 20 and n = 40) and was able to get higher than the 91% accuracy by using higher numbers of iterations.

taohaoxiong · November 14, 2021, 5:29am

thanks for the info.
BTW, I tried on other datasets provided in the end, basically the diff between tanh and ReLU not big.

noisy_circles: tanh: 80%, ReLU: slightly lower

noisy_moons: tanh: 97%, ReLU: similar

blobs: tanh: 83%, ReLU: similar

gaussian_quantiles: tanh: 99%, ReLU: 100%

note: above just rough results, since didn’t adjust the hyper-parameters much

Topic		Replies	Views
W3_A1_ReLu as Activation function Neural Networks and Deep Learning	3	623	March 30, 2023
How to apply relu function in Exercise of week 3(optional).) Neural Networks and Deep Learning	5	540	July 12, 2023
Course1 - Week3 Assignment - ReLU gave worse performance than tanh Neural Networks and Deep Learning	3	550	September 9, 2021
Doubt about Relu activation in hidden layer Advanced Learning Algorithms week-2	3	807	January 29, 2023
W3 A1 Relu Activation doesn't work Neural Networks and Deep Learning week-3	2	37	October 29, 2024

Relu/LRelu does not work for forward propagation in Planar_data_classification_with_one_hidden_layer

Related topics