For the sigmoid, we used the negative log-likelihood. Is there a similar approach for deriving good cost functions for tanh and ReLU? Are there convex cost functions for them as well?

Given ReLU’s popularity, I’m especially curious about the most appropriate cost function for ReLU. Would it be just the ordinary squared loss function?

Hi @gaussian. The optional video “Explanation of the logistic regression cost function” from Week 2, shows how the Principle of Maximum Likelihood can be used to derived that cost function from a Bernoulli distribution probability model. That is the natural distribution for a “Bernoulli trial”, the outcome of an event that has probability p of “success” and probability 1-p of “failure”. That fits very well the conditional probability model of an image being one of a cat, or of something else.

They key here is that we are modeling probabilities in classification tasks, and so one needs to start with a probability model. the tanh and ReLU functions are not candidates for a probability model. (Why?) In the context of an ordinary linear regression model where the output is a continuous variable (on the real line) and the errors are Gaussian, the Maximum Likelihood principle leads to a mean squared error (MSE) cost function.

As for the second part of your question, the ReLU function could be used as the activation in the simple linear regression model if non-negative outputs do not make sense (e.g. house prices). That is, the output is

y = \max\lbrace 0, wx+b\rbrace.

In that case, the errors are typically assumed to have a truncated normal distribution. Maximum Likelihood can be applied here to derive an appropriate cost function. Due to the nonlinearity of the model, it’s probably a bit ugly.