Clarifying Which Training Error (Train vs. Eval Mode) to Use for Bias–Variance Decomposition

Hi everyone,

I’m studying Batch Normalization, but I am have some difficulty understanding how to calculate the bias and variance.

What I’ve learned so far:

  1. Bias ≈ (training error) − (Bayes error), where Bayes error is the irreducible error (e.g. approximated by human performance on the same task).
  2. Variance ≈ (validation error) − (training error).

The confusion:

  • During training, we run the model in train mode:
    • BatchNorm uses each mini-batch’s mean/variance
    • Dropout is active
    • This gives us the train-mode error that guides weight updates.
  • During evaluation (dev/test), we switch to eval mode:
    • BatchNorm uses the running mean/variance
    • Dropout is disabled
    • This gives us the eval-mode error that reflects real-world inference.

So there are two different “training errors”:

  1. Train-Mode Error – computed in model.train() with batch stats & dropout
  2. Train-Eval Error – computed in model.eval() on the training set using running stats & no dropout

My question is:

For the bias–variance decomposition, which of these two training errors should be used as “training error”?

  • If I use train-mode error, it includes regularization noise (dropout/Bn fluctuations), so it seems above the true representational capacity of the model.
  • If I use train-eval error, it mirrors the inference conditions and should lie closer to the model’s best achievable error (i.e. closer to Bayes error).

Intuitively, I think both bias and variance should be measured in eval mode, so that:

  • Bias = train-eval error− Bayes error
  • Variance = validation error − train-eval error

This way, I’m comparing “apples to apples” for both train and dev/test. But I’ve also seen advice that suggests using the train-mode error to gauge variance.

Can anyone explain :

  1. Which of these training errors is correct to use for bias and for variance, and
  2. Why the standard practice chooses one over the other?

Thanks in advance for clarifying this subtle but crucial point!

Yes, this is a perceptive and important question. When we are actually evaluating the performance of the model, it is always done in “inference” mode not training mode, exactly for the reason you mentioned: you don’t want the additional noise from regularization included in the results. The point of regularization is that it affects how the training works and thus gives us different weight values that we’d get without it. But then when we want to see how the resulting model performs, we run it in “plain vanilla” inference mode: just forward propagation without any regularization.

As I’m guessing you already realize, when Prof Ng talks about “training error”, he just means the prediction errors on the training data, but in inference mode. And “test error” or “validation error” just means the prediction errors using the test or validation datasets.

One other subtlety perhaps worth mentioning is that the word “variance” has a number of different meanings. When he talks about “bias versus variance”, the variance there means something different than the mean and variance used in Batch Norm. In the Batch Norm case, it’s the usual statistical meaning of variance, which is a well-defined metric.