Building an intuition for which activation function makes sense to a problem

Hello everyone,

I am going through the DL course 1. Because of some previous background the math and algorithms are fine to my brain, but there is a practical, very important concept that I can’t easily wrap my head around.

How do you know which activation function makes sense to a problem? I have read that, in the end a multilayer NN could probably approximate any data reasonably well, but some could take longer.

Yet I am interested in either a single layer, or a multilayer but where you are trying to use some mathematical intuition to know which function to use.

In broad sense it seems that people would use maybe a linear activation function for a is for a continuous value output (prices, weights, coordinates) without caring whether it is line-like or not (see previous paragraph).

They would use a sigmoid function for a binary or maybe multi classification problem for restriction 0-1 convenience etc.

Is there any video, page, blog, book, resource (maybe your own experience) you would recommend to see some examples, and how more knowledgeable people think about these functions, that is not too complicated for a beginner?

I am aware that practice will show me a lot of tricks for it, but I’d like to see a what a more educated person says about it.


There are two separate issues here:

  1. Which activation function to use at the output layer
  2. Which activation function(s) to use in the hidden layers of the network

For the output layer, the choice is determined by what your network is predicting. If it is a classification problem, then you use sigmoid for binary classifications (cat/not a cat) and softmax for multiclass classifications (cat, dog, zebra, horse, kangaroo …). Also note that there is a loss function that goes naturally with sigmoid and softmax, which is the cross entropy (“log loss”) loss function. You can think of softmax as the multiclass generalization of sigmoid.

For “regression” problems where you are predicting a continuous numeric value (stock price, temperature, …), then you’re right that it might make sense to just the linear output or ReLU in the case that a negative output value does not make sense. In that type of problem you want a distance based loss function, so typically that would be either MSE (mean squared error) or perhaps MAE (mean absolute error).

For the hidden layers of the network, you have a lot more freedom. Here’s a thread which discusses that.

1 Like

I didn’t have time to read it until now. It is very clear, thank you.