Hey @Elvis_Lok,
Some interesting questions indeed as Tom rightfully said it. I have tried to explore a little bit about your first query.
Check out the 9th version of this kernel. In this kernel, I have tried to train 2 different neural networks model1
and model2
with the exact same architectures. They only differ in terms of the activation functions in the hidden layers.
It looks like that model1
which consistently uses ReLU outperforms model2
, which uses a combination of ReLU, Sigmoid and Tanh. However, is this a proof that this happens always? Definitely not!
I have run only a single experiment, with a single dataset, with a single model architecture, with a single loss function, with a single optimizer, with a single set of hyper-parameter values, and a lot many other “single” things Moreover, as Prof Andrew discussed in the lecture videos, ReLU is known to perform well on a wide range of tasks, and hence, using ReLU consistently might have an inherent advantage as well. So, in order to get a solid proof, you might need to run this experiment a considerable number of times with different combinations of things mentioned above, and trying to use activation functions which are comparable to each other, so that they don’t have an inherent advantage of each other?
But you might wonder if the below things is really possible
which are comparable to each other, so that they don’t have an inherent advantage of each other
I wonder the same too, since it is because of these inherent and other task-dependent advantages that we prefer one activation function over the another. So, you will have to try to find and group related activation functions, on which you can perform this experiment.
Well that’s it for the first query. For the second query, we leave it up to you to implement, and share your results with the community.
Cheers,
Elemento