Using different activation functions in the hidden layer?

Elvis_Lok · July 20, 2022, 4:25pm

What if I tried to use different activation functions for different layers in the hidden layer? (e.g. using sigmoid for layer 1, and using ReLU in layer 2) What would happen to the model/ will the performance of the model improve?

Another similar question, can I use a different activation function for different units within one layer? (e.g sigmoid for the 1st unit of layer 1, ReLU for the 2nd unit of layer 1)

TMosh · July 20, 2022, 7:21pm

These are interesting questions that you can explore.

Please post back your results for discussion.

Elemento · July 21, 2022, 7:44am

Hey @Elvis_Lok,
Some interesting questions indeed as Tom rightfully said it. I have tried to explore a little bit about your first query.

Check out the 9th version of this kernel. In this kernel, I have tried to train 2 different neural networks model1 and model2 with the exact same architectures. They only differ in terms of the activation functions in the hidden layers.

It looks like that model1 which consistently uses ReLU outperforms model2, which uses a combination of ReLU, Sigmoid and Tanh. However, is this a proof that this happens always? Definitely not!

I have run only a single experiment, with a single dataset, with a single model architecture, with a single loss function, with a single optimizer, with a single set of hyper-parameter values, and a lot many other “single” things Moreover, as Prof Andrew discussed in the lecture videos, ReLU is known to perform well on a wide range of tasks, and hence, using ReLU consistently might have an inherent advantage as well. So, in order to get a solid proof, you might need to run this experiment a considerable number of times with different combinations of things mentioned above, and trying to use activation functions which are comparable to each other, so that they don’t have an inherent advantage of each other?

But you might wonder if the below things is really possible

which are comparable to each other, so that they don’t have an inherent advantage of each other

I wonder the same too, since it is because of these inherent and other task-dependent advantages that we prefer one activation function over the another. So, you will have to try to find and group related activation functions, on which you can perform this experiment.

Well that’s it for the first query. For the second query, we leave it up to you to implement, and share your results with the community.

Cheers,
Elemento

Topic		Replies	Views
In a layer, is it necessary to use the same activation functions? Advanced Learning Algorithms week-module-2	7	341	October 4, 2023
Using different activation function for hidden layers Neural Networks and Deep Learning coursera-platform	4	1687	February 7, 2022
Week 3, Programming assignment: how were your performances of sigmoid or ReLu? Neural Networks and Deep Learning coursera-platform	1	629	December 6, 2021
ReLU function as activation function Advanced Learning Algorithms week-module-2	3	422	July 11, 2023
Activation functions in the hidden layers Advanced Learning Algorithms week-module-2	4	510	July 21, 2022

Using different activation functions in the hidden layer?

Related topics