Cost Function when output layer has activation other than sigmoid

Anas_Al_Zabiby · March 23, 2022, 6:01am

Hi there,

Prof Ng mentioned in one of his videos that you could use an activation function for the output layer other than Sigmoid. However, all subsequent calculations of Back Prop were based on having a Sigmoid activation function and its cost function (as defined for LR).

Is there a reference or some guidance on how/what to get a Cost Function when the activation function in the output layer is not a Sigmoid, say ReLU (as suggested in one example by Prof Ng).

I’m just trying to get a generalized broad understanding on where we can tweak and play with the algorithm.

Thanks,

Rashmi · March 23, 2022, 8:09am

Hi @Anas_Al_Zabiby, this seems an interesting query and here’s a thread that you can go through to get an idea on why or why not to use ReLU as an activation function for the output layer.
Also, @paulinpaloalto sir and other mentors could throw other realistic approaches towards this query. Thanks!

kenb · March 23, 2022, 2:58pm

Hello, @Anas_Al_Zabiby. In terms of mere feasibility, a wide range of activation functions could be applied in the output layer. But the choice of the output activation should be chosen according to the task at hand.

In (logistic) binary classification, we need an output that can be interpreted as a probability, i.e. a number between zero and one, to decide if an example (e.g. a digitized image) is one of a particular class (e.g. a cat), or not. In terms of probability theory, the sigmoid (i.e. logistic) activation function is a proper “cumulative distribution function” which satisfies the requirement (i.e. axioms) of a probability measure. In this case, ReLU and tanh, for example, are non-starters.

An alternative to the logistic regression model could, for example, the the “probit model” where the cumulative distribution function is that corresponding to the Normal distribution. The shape of probit function closely mimics that of the logistic, but is (way) more burdensome from a computational perspective. Hence, the popularity of the logistic. The main point is that the output activation must be appropriate to the task.

For experimental purposes, just about anything that “squashes” the real line into a [0, 1] range is worth a try. These include the “Heaviside function” and a truncated linear regression model where all z values (linear activations) below zero are set to zero, and those above one are set to one. Good hunting!

Rashmi · March 23, 2022, 3:23pm

@kenb sir, good explanation! Thanks!

paulinpaloalto · March 23, 2022, 3:33pm

As Ken said in his great explanation, the choice of activation at the output layer fundamentally depends on what the meaning of the output value is and what you are trying to accomplish with your neural network. If it’s a binary (“yes/no”) classification problem, then Ken explained why ReLU would not be a useful choice. The cases that Prof Ng was probably referring to where ReLU could be used at the output layer is when the purpose of your network is a “regression problem”: that is a problem in which the output is a continuous real number that predicts some other kind of value like a house price or the air temperature or the amount of precipitation. It is perfectly possible to use neural networks to predict other kinds of outputs besides “yes/no” classifications. If you consider the case of trying to predict a stock price, for example, then you could use ReLU as the output activation and then use a “distance based” loss function such as MAE (Mean Absolute Error) or MSE (Mean Squared Error). The point is that you want to choose an appropriate measure for how good or bad your prediction is. In the case you’re trying to predict a positive number that could be between 0 and \infty basically, then the combination of ReLU plus MSE would probably be a good choice. Of course you would need to go through the calculations for both forward and backward propagation to adjust for the new functions. MSE is generally preferred over MAE because the derivatives are better behaved and the quadratic behavior punishes wrong answers more severely than small errors.

Rashmi · March 23, 2022, 3:50pm

@paulinpaloalto sir, thank you to you too! Very well explained!

Anas_Al_Zabiby · March 24, 2022, 5:22am

Hi @paulinpaloalto , this makes a lot of sense. It’s exactly what I wanted to know.

I understand the links between gradients and the cost function, and it bothered me that we tweaked hidden layers and argued using different activations, but the output stays as is; it’s a bit trickier especially that it’s linked to a cost function that also needs to change if the output activation outputs anything besides the ‘yes/no’ classification.

That said, thanks for providing your input about the MSE cost with ReLU.

Anas_Al_Zabiby · March 24, 2022, 5:31am

@kenb thanks a lot for the elaborate details. I agree, the output activation must be appropriate to the task, and surely, some tasks demand understanding how the output activation (and its cost if output is outside [0,1]) can change to fit the question asked.

Topic		Replies	Views
Real world scenario using sigmoid as an activation function Advanced Learning Algorithms week-1	1	606	July 10, 2022
Week 2, prog_assgn, Ex-2 Convolutional Neural Networks	5	529	October 25, 2021
What if the last layer is not sigmoid? AI Discussions	4	76	December 2, 2023
Week3 - Choice of Activation function Neural Networks and Deep Learning	2	749	February 5, 2022
W2C2 Why do we need activation function? Advanced Learning Algorithms week-2	14	653	March 6, 2024

Cost Function when output layer has activation other than sigmoid

Related topics