In this course we learn about using single value metric to choose hyper parameters. I am confused between the concept of metric and loss.
In my understanding, this is what they are, please correct me if I’m wrong:

metric: Calculated on the dev set. It’s function is to let us pick out the best model/hyperparameters. This is done through trying out different combinations of models/hyperparameters, and choosing the combination that optimize (maximize) the metric.

Loss: Calculated on the training set on each mini-batch. It’s function is to train the model. This is done through trying to optimize (minimize) loss by optimization methods such as gradient descent.

But why are the two different? Shouldn’t both of them reflect the end goal we want our final model to perform?

For example, why would it make sense if we want a model that optimize F1 score, but train it using binary cross entropy as the loss function? I understand that F1 score is not differentiable, so we cannot use it for gradient descent. But I have also seen simple adaptions where we switch y_predict from binary to probability when calculating F1 score, allowing it to be differentiable.

An optimization criterion (also called objective, cost, or loss function) is calculated, as you correctly mentioned, during the training process to estimate the best parameters for our model. A negative log-likelihood (NLL) or mean squared error (MSE) would be an example of the loss function. Both are based on the principle of the maximum likelihood estimation.

We usually use term loss referring to a single training example and cost for an average loss over training examples of the minibatch.

A metric may basically be any function that helps us understand how well our learning algorithm performs. F-score would be an example of the metric.

Why single training example? Like you mentioned, mean squared error is a loss function. It is “mean” square error, not error. Mean as in average over many samples. So shouldn’t loss be calculated on a (mini) batch of training samples?

Yes, I see why it may be a little bit confusing. A function that computes an average loss over training examples of the minibatch is usually called the cost function.

I suppose one possible answer to this question would be – just try it on your task

The negative log-likelihood is directly minimizing the difference between the training data distribution (we assume it’s close to true data distribution) and the model distribution. We want to model the true data distribution, and it seems that with negative log-likelihood we are doing exactly that.

I don’t see how a loss function based on F-score may be a better alternative to the negative log-likelihood, but it would definitely be more expensive to compute.

@manifest suppose we are trying detect market crash. In this case, our priority wouldn’t be to model the true data distribution, but to correctly detect these events as much as we can without raising too much false alarm. So F1 score may be preferable to negative log likelihood?

Our goal is always to model the true data distribution. One way to think about that is the training set is just a subset of the true unknown data set. We will do better at detecting a market crash if our model recognizes patterns of the true unknown data set comparing to a limited part of it.

But if modeling the true data distribution is always the goal, why isn’t likelihood our metrics?

Sorry, I am still confused why would metric and loss/cost be different if both of them are a means of reflecting how well or bad our model is doing on our target task.

Since we can only observe training examples (only a subset of true data) and maximizing log-likelihood only makes our model predictions close to the training set data, we can’t be confident that the model will perform equally well on the true data.

Task specific metrics (such as F-score for a classification task) make it simple for humans to evaluate results.

It possible that some other loss functions would work better on your particular task – I suppose, you just need to experiment.

You may read more on empirical risk minimization and surrogate loss functions in the 8th chapter of Goodfellow’s Deep Learning book.