Building an intuition for which activation function makes sense to a problem

Mah_Neh · August 1, 2022, 5:31pm

Hello everyone,

I am going through the DL course 1. Because of some previous background the math and algorithms are fine to my brain, but there is a practical, very important concept that I can’t easily wrap my head around.

How do you know which activation function makes sense to a problem? I have read that, in the end a multilayer NN could probably approximate any data reasonably well, but some could take longer.

Yet I am interested in either a single layer, or a multilayer but where you are trying to use some mathematical intuition to know which function to use.

In broad sense it seems that people would use maybe a linear activation function for a is for a continuous value output (prices, weights, coordinates) without caring whether it is line-like or not (see previous paragraph).

They would use a sigmoid function for a binary or maybe multi classification problem for restriction 0-1 convenience etc.

Is there any video, page, blog, book, resource (maybe your own experience) you would recommend to see some examples, and how more knowledgeable people think about these functions, that is not too complicated for a beginner?

I am aware that practice will show me a lot of tricks for it, but I’d like to see a what a more educated person says about it.

Thanks.

paulinpaloalto · August 1, 2022, 6:07pm

There are two separate issues here:

Which activation function to use at the output layer
Which activation function(s) to use in the hidden layers of the network

For the output layer, the choice is determined by what your network is predicting. If it is a classification problem, then you use sigmoid for binary classifications (cat/not a cat) and softmax for multiclass classifications (cat, dog, zebra, horse, kangaroo …). Also note that there is a loss function that goes naturally with sigmoid and softmax, which is the cross entropy (“log loss”) loss function. You can think of softmax as the multiclass generalization of sigmoid.

For “regression” problems where you are predicting a continuous numeric value (stock price, temperature, …), then you’re right that it might make sense to just the linear output or ReLU in the case that a negative output value does not make sense. In that type of problem you want a distance based loss function, so typically that would be either MSE (mean squared error) or perhaps MAE (mean absolute error).

For the hidden layers of the network, you have a lot more freedom. Here’s a thread which discusses that.

Mah_Neh · August 3, 2022, 3:46pm

I didn’t have time to read it until now. It is very clear, thank you.

Topic		Replies	Views
Why ReLU and softmax? NLP with Probabilistic Models week-4	1	595	November 2, 2021
Higher dimensional activation functions Neural Networks and Deep Learning	4	549	July 2, 2021
First binary classification model Neural Networks and Deep Learning	5	540	July 12, 2022
In NN are activation function alway logistic regesstions? Advanced Learning Algorithms week-1	2	477	February 14, 2023
Activation function in NN NLP with Classification and Vector Spaces week-3	3	327	March 30, 2022

Building an intuition for which activation function makes sense to a problem

Related topics