Bias and variance;

Week 3 - Diagnosing Bias and Variance at timestamp 9.20

I have checked Week 3 - Diagnosing Bias and Variance at 9:20 and I can now see a plot illustrating high bias and hogh variance.

Thanks.

To formally express the bias-variance tradeoff in regression, we construct the following setup:

  • Draw a training dataset {\mathbb D} = \{x^{(i)}, y^{(i)} \}_{i=1}^{|\mathbb D|}, which contains |\mathbb D| elements such that y^{(i)} = f(x^{(i)}) + \xi^{(i)}, where f(\cdot) is a ground truth function, and \xi^{(i)} \sim {\mathcal N}(0, \sigma^2) is a Gaussian noise.
  • A model is trained on {\mathbb D}, denoted by f_{\mathbb D}.
  • For a test example (x, y) such that y = f(x) + \xi, where \xi \sim {\mathcal N}(0, \sigma^2), we measure the expected test error (averaged over the random draw of the training set {\mathbb D} and the randomness of \xi)
{\rm MSE}(x) = {\mathbb E}_{{\mathbb D}, \xi} [(y − f_{\mathbb D}(x))^2].

The test input x is considered to be fixed, but the same setup holds when we average over multiple examples. We can decompose {\rm MSE}(x):

\begin{align} {\rm MSE}(x) &= \mathbb E[(\xi + (f(x) - f_{\mathbb D}(x)))^2] \\ & = \mathbb E[\xi^2] + \mathbb E[(f(x) - f_{\mathbb D}(x))^2] \\ & = \sigma^2 + \mathbb E[(f(x) - f_{\mathbb D}(x))^2] \\ & = \sigma^2 + \mathbb E[(f(x) − \mathbb E[f_{\mathbb D}(x)] + \mathbb E[f_{\mathbb D}(x)] - f_{\mathbb D}(x))^2] \\ & = \sigma^2 + (f(x) − \mathbb E[f_{\mathbb D} (x)])^2 + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2] \\ & = \underbrace{\sigma^2}_{\rm noise} + \underbrace{(f(x) − \mathbb E[f_{\mathbb D} (x)])^2}_{\rm bias^2} + \underbrace{{\rm Var}(f_{\mathbb D} (x))}_{\rm variance}. \end{align}

Here \mathbb E[f_{\mathbb D} (x)] corresponds to training the model on infinitely many datasets and averaging their predictions at x. The bias quantifies the error introduced by the model’s inability to represent the true function f(x) and reflects the limitations of the model class (“underfitting”). The variance measures the sensitivity of the learned model to the randomness in the dataset.
Since we can’t draw an infinite number of datasets, the bootstrap method allows us to estimate \mathbb E[f_{\mathbb D} (x)] in practice. We resample datasets from the one dataset we have, using sampling with replacement. Then we train multiple models on these bootstrap-resampled datasets and use the average prediction across bootstrap models to approximate bias and variance.

Thanks but I am unfamiliar with some of your notation as it is not used in Andrew’s course.

What do you mean by “draw a training set”?

What does this notation mean |\mathbb D|?

It’s not clear to me what the bootstrap method is or bias-variance decomposition.

When we say “draw a training set”, we mean randomly sample a set of input–output pairs \{(x^{(i)}, y^{(i)})\} from some underlying data-generating distribution p. This constitutes one draw of a training set \mathbb{D} \sim p^{|\mathbb{D}|}, the product distribution over |\mathbb{D}| i.i.d. pairs. The notation |\mathbb{D}| means the size of the dataset \mathbb{D}, i.e. if \mathbb{D} = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(n)}, y^{(n)}) \}, then |\mathbb{D}| = n.
Let me know if you’d like me to add an implementation of the bootstrap method to the notebook.

What is a “product distribution”?

What does “i.i.d.” mean?

Also I still don’t know what the bootstrap method is or bias-variance decomposition as these terms are not used by Andrew in his video lessons up till now.

How do you get from:

\rm MSE(x) = \sigma^2 + \mathbb E[(f(x) − \mathbb E[f_{\mathbb D}] + \mathbb E[f_{\mathbb D}] - f_{\mathbb D}(x))^2]

to:

\rm MSE(x) = \sigma^2 + (f(x) − \mathbb E[f_{\mathbb D} (x)])^2 + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2]

It is a way of describing multiple independent and identically distributed (i.i.d.) samples. Each data point is drawn independently from the same distribution.

Although the bias–variance decomposition isn’t covered in this course, I think it’s a useful extra material. The bootstrap method is simply a practical tool to estimate bias and variance using the data you already have.

I see.

Thank you.

Taking into account that f(x) − \mathbb E[f_{\mathbb D}(x)] is a constant, we have:

\begin{align} {\rm MSE}(x) = &\ \sigma^2 + \mathbb E[(f(x) − \mathbb E[f_{\mathbb D}(x)] + \mathbb E[f_{\mathbb D}(x)] - f_{\mathbb D}(x))^2] \\ = &\ \sigma^2 + \mathbb E[(f(x) − \mathbb E[f_{\mathbb D}(x)])^2] + \\ & 2 \mathbb E[(f(x) − \mathbb E[f_{\mathbb D}(x)])(\mathbb E[f_{\mathbb D}(x)] - f_{\mathbb D}(x))] + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2] \\ = &\ \sigma^2 + (f(x) − \mathbb E[f_{\mathbb D}(x)])^2 + \\ & 2 (f(x) − \mathbb E[f_{\mathbb D}(x)]) \mathbb E[\mathbb E[f_{\mathbb D}(x)] - f_{\mathbb D}(x)] + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2] \\ = &\ \sigma^2 + (f(x) − \mathbb E[f_{\mathbb D}(x)])^2 + \\ & 2 (f(x) − \mathbb E[f_{\mathbb D}(x)])(\mathbb E[f_{\mathbb D}(x)] - \mathbb E[f_{\mathbb D}(x)]) + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2] \\ & = \sigma^2 + (f(x) − \mathbb E[f_{\mathbb D} (x)])^2 + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2] \end{align}

Thanks Pavel.

1 Like