Cost Function when output layer has activation other than sigmoid

Hi there,

Prof Ng mentioned in one of his videos that you could use an activation function for the output layer other than Sigmoid. However, all subsequent calculations of Back Prop were based on having a Sigmoid activation function and its cost function (as defined for LR).

Is there a reference or some guidance on how/what to get a Cost Function when the activation function in the output layer is not a Sigmoid, say ReLU (as suggested in one example by Prof Ng).

I’m just trying to get a generalized broad understanding on where we can tweak and play with the algorithm.


Hi @Anas_Al_Zabiby, this seems an interesting query and here’s a thread that you can go through to get an idea on why or why not to use ReLU as an activation function for the output layer.
Also, @paulinpaloalto sir and other mentors could throw other realistic approaches towards this query. Thanks!

Hello, @Anas_Al_Zabiby. In terms of mere feasibility, a wide range of activation functions could be applied in the output layer. But the choice of the output activation should be chosen according to the task at hand.

In (logistic) binary classification, we need an output that can be interpreted as a probability, i.e. a number between zero and one, to decide if an example (e.g. a digitized image) is one of a particular class (e.g. a cat), or not. In terms of probability theory, the sigmoid (i.e. logistic) activation function is a proper “cumulative distribution function” which satisfies the requirement (i.e. axioms) of a probability measure. In this case, ReLU and tanh, for example, are non-starters.

An alternative to the logistic regression model could, for example, the the “probit model” where the cumulative distribution function is that corresponding to the Normal distribution. The shape of probit function closely mimics that of the logistic, but is (way) more burdensome from a computational perspective. Hence, the popularity of the logistic. The main point is that the output activation must be appropriate to the task.

For experimental purposes, just about anything that “squashes” the real line into a [0, 1] range is worth a try. These include the “Heaviside function” and a truncated linear regression model where all z values (linear activations) below zero are set to zero, and those above one are set to one. Good hunting!

1 Like

@kenb sir, good explanation! Thanks!

As Ken said in his great explanation, the choice of activation at the output layer fundamentally depends on what the meaning of the output value is and what you are trying to accomplish with your neural network. If it’s a binary (“yes/no”) classification problem, then Ken explained why ReLU would not be a useful choice. The cases that Prof Ng was probably referring to where ReLU could be used at the output layer is when the purpose of your network is a “regression problem”: that is a problem in which the output is a continuous real number that predicts some other kind of value like a house price or the air temperature or the amount of precipitation. It is perfectly possible to use neural networks to predict other kinds of outputs besides “yes/no” classifications. If you consider the case of trying to predict a stock price, for example, then you could use ReLU as the output activation and then use a “distance based” loss function such as MAE (Mean Absolute Error) or MSE (Mean Squared Error). The point is that you want to choose an appropriate measure for how good or bad your prediction is. In the case you’re trying to predict a positive number that could be between 0 and \infty basically, then the combination of ReLU plus MSE would probably be a good choice. Of course you would need to go through the calculations for both forward and backward propagation to adjust for the new functions. MSE is generally preferred over MAE because the derivatives are better behaved and the quadratic behavior punishes wrong answers more severely than small errors.

1 Like

@paulinpaloalto sir, thank you to you too! Very well explained!

Hi @paulinpaloalto , this makes a lot of sense. It’s exactly what I wanted to know.

I understand the links between gradients and the cost function, and it bothered me that we tweaked hidden layers and argued using different activations, but the output stays as is; it’s a bit trickier especially that it’s linked to a cost function that also needs to change if the output activation outputs anything besides the ‘yes/no’ classification.

That said, thanks for providing your input about the MSE cost with ReLU.

@kenb thanks a lot for the elaborate details. I agree, the output activation must be appropriate to the task, and surely, some tasks demand understanding how the output activation (and its cost if output is outside [0,1]) can change to fit the question asked.