Which set should I calculate metrics on?

In the video Single Number Evaluation Metric, Andrew Ng advises us to pick a single metric to compare models. Which of the train/dev set should this metric be calculated on?

Of course, the dev set metric is more important, but some model changes such as increasing capacity are targeted at underfitting, so they only improve performance on the dev set via improving it on the train set. So, I am deciding between two strategies.

Strategy 1: In the spirit of Orthogonalization, use a two-phase approach. In the first phase, compare models by the train set metric until we find a model that reaches an acceptable level of performance (e.g. 95% train accuracy). At this point, we can train models with early stopping based on train set loss/accuracy. In the second phase, compare models by the dev set metric until we find a model that reduces the gap between the train set metric and dev set metric sufficiently (e.g. 92% dev accuracy). Now we do early stopping based on dev set loss/accuracy.

Strategy 2: Always optimise the dev set metric. When training, always use early stopping based on the dev set metric. Train set loss and metrics can still be useful diagnostics to figure out which new models or pipelines to try, but if a new model improves the train set metric while worsening the dev set metric, it should be immediately discarded.

I suspect that Strategy 2 is the better one, but I’m training a model at the moment where early stopping based on dev set loss is causing training to stop while train set loss is still decreasing rapidly, and it’s tempting to get the best possible performance on the train set first and worry about the dev set later. What do others think?

Hello @akubi,

I think strategy 2 is the standard approach.

Even if we use strategy 1, my guess is no matter how well the first phase does, in the second phase, it still ends up stopping early before the training set reaches any acceptable loss. The key is, whether the first phase does any good to remove the bottleneck of dev set performance, and if not, the first phase of strategy 1 seems likely be redundant and finally the two strategies converge to be the same. Make sense?

The point here should still be how to remove the bottleneck, and if I were you, I probably am going to inspect some poorly predicted samples in the dev set, and bring my understanding of the problem to the table and see if I can think of something. Course 3 Week 2’s error analysis should be relevant here.


That makes sense, thank you for the clarification