When playing with different activation functions for hidden layer I found out, that ReLU gives similar accuracy on specific datasets as tanh function. But on main dataset it gives me far worse accuracy, usually around 70%, or around 80% on hidden_layer_size = 30, even though I try manipulating learning_rate and number of iterations. Is there specific reason why it performs worse for this scenario compared to tanh?

As implementation for ReLu I use (X * (X > 0)) and for derivative (1. * (X > 0)) where X is matrix.

It’s great that you are trying experiments like this. You always learn something when you try to extend the ideas in the course. Here’s another thread related to this topic from a while ago. I was able to get 81% accuracy using ReLU with n_h = 40 and some other folks were able to get 85% accuracy with ReLU.

Your implementation of ReLU and ReLU’ look correct to me, but maybe you need a higher n_h value. Note that the n_h = 4 that works pretty well with tanh gives really terrible results with ReLU.

Thanks. With n_h = 40, learning_rate = 0.655 and iterations = 12k. I got 86% accuracy, still far from tanh performance on this dataset. Is there any concrete explanation, why it underperforms here?

I have done some experiments with this dataset and the same architecture (except using different number of neurons and activations for the hidden layer). I also tried different seeds. To save my work because I am lazy , I implemented my experiment with Tensorflow Keras, instead of modifying the assignment.

Hope this can be another starting point for you to further explore about neural networks.

Setting:
learning rate = 0.04
weight initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01)
number of iteration = 150000

Activation

num of neurons

seed=0

seed=1

seed=2

ReLU

16

0.84

0.8275

0.8175

ReLU

64

0.885 (figure 1)

0.87

0.875

Concatenated ReLU

16

0.85

0.865

0.855

Concatenated ReLU

64

0.8975 (figure 2)

0.8825

0.885

tanh

4

0.905

0.9075

Explanation of Concatenated ReLU. It is an effect of Concatenated ReLU that it will double the number of neurons listed in the table.

While my above results show that ReLU can be comparable, I wondered why ReLU took so many more neurons. After a few checks, I decided to do the following experiments, and it turns out that the (modified) ReLU can be equally good with also only 4 neurons , although this result should generate more interesting questions and needs of experiments.

Cheers,
Raymond

Setting:
learning rate = 0.1
weight initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev= 2.) # <== Note this change
number of iteration = 10000

Hi @rmwkwok , very interesting! I’ve been following this thread and have learned a lot!

You mentioned that you built the model on keras. Could you share the summary of your NN? Are these just Dense layers with Relu/ConcatenatedRelu activations?

I have updated my previous post with the summary. And yes, they are just Dense layers with ReLU/CReLU/Modified ReLU. I tried to stick with the same architecture as used in the assignment.