Using different activation functions in the hidden layer?

What if I tried to use different activation functions for different layers in the hidden layer? (e.g. using sigmoid for layer 1, and using ReLU in layer 2) What would happen to the model/ will the performance of the model improve?

Another similar question, can I use a different activation function for different units within one layer? (e.g sigmoid for the 1st unit of layer 1, ReLU for the 2nd unit of layer 1)

These are interesting questions that you can explore.

Please post back your results for discussion.

Hey @Elvis_Lok,
Some interesting questions indeed as Tom rightfully said it. I have tried to explore a little bit about your first query.

Check out the 9th version of this kernel. In this kernel, I have tried to train 2 different neural networks model1 and model2 with the exact same architectures. They only differ in terms of the activation functions in the hidden layers.

It looks like that model1 which consistently uses ReLU outperforms model2, which uses a combination of ReLU, Sigmoid and Tanh. However, is this a proof that this happens always? Definitely not!

I have run only a single experiment, with a single dataset, with a single model architecture, with a single loss function, with a single optimizer, with a single set of hyper-parameter values, and a lot many other “single” things :joy: Moreover, as Prof Andrew discussed in the lecture videos, ReLU is known to perform well on a wide range of tasks, and hence, using ReLU consistently might have an inherent advantage as well. So, in order to get a solid proof, you might need to run this experiment a considerable number of times with different combinations of things mentioned above, and trying to use activation functions which are comparable to each other, so that they don’t have an inherent advantage of each other?

But you might wonder if the below things is really possible :thinking:

which are comparable to each other, so that they don’t have an inherent advantage of each other

I wonder the same too, since it is because of these inherent and other task-dependent advantages that we prefer one activation function over the another. So, you will have to try to find and group related activation functions, on which you can perform this experiment.

Well that’s it for the first query. For the second query, we leave it up to you to implement, and share your results with the community.


1 Like