Bias and variance;

Kic · June 12, 2025, 3:13pm

Week 3 - Diagnosing Bias and Variance at timestamp 9.20

ai_is_cool · June 12, 2025, 4:54pm

I have checked Week 3 - Diagnosing Bias and Variance at 9:20 and I can now see a plot illustrating high bias and hogh variance.

Thanks.

conscell · June 13, 2025, 5:43am

To formally express the bias-variance tradeoff in regression, we construct the following setup:

Draw a training dataset {\mathbb D} = \{x^{(i)}, y^{(i)} \}_{i=1}^{|\mathbb D|}, which contains |\mathbb D| elements such that y^{(i)} = f(x^{(i)}) + \xi^{(i)}, where f(\cdot) is a ground truth function, and \xi^{(i)} \sim {\mathcal N}(0, \sigma^2) is a Gaussian noise.
A model is trained on {\mathbb D}, denoted by f_{\mathbb D}.
For a test example (x, y) such that y = f(x) + \xi, where \xi \sim {\mathcal N}(0, \sigma^2), we measure the expected test error (averaged over the random draw of the training set {\mathbb D} and the randomness of \xi)

{\rm MSE}(x) = {\mathbb E}_{{\mathbb D}, \xi} [(y − f_{\mathbb D}(x))^2].

The test input x is considered to be fixed, but the same setup holds when we average over multiple examples. We can decompose {\rm MSE}(x):

\begin{align} {\rm MSE}(x) &= \mathbb E[(\xi + (f(x) - f_{\mathbb D}(x)))^2] \\ & = \mathbb E[\xi^2] + \mathbb E[(f(x) - f_{\mathbb D}(x))^2] \\ & = \sigma^2 + \mathbb E[(f(x) - f_{\mathbb D}(x))^2] \\ & = \sigma^2 + \mathbb E[(f(x) − \mathbb E[f_{\mathbb D}(x)] + \mathbb E[f_{\mathbb D}(x)] - f_{\mathbb D}(x))^2] \\ & = \sigma^2 + (f(x) − \mathbb E[f_{\mathbb D} (x)])^2 + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2] \\ & = \underbrace{\sigma^2}_{\rm noise} + \underbrace{(f(x) − \mathbb E[f_{\mathbb D} (x)])^2}_{\rm bias^2} + \underbrace{{\rm Var}(f_{\mathbb D} (x))}_{\rm variance}. \end{align}

Here \mathbb E[f_{\mathbb D} (x)] corresponds to training the model on infinitely many datasets and averaging their predictions at x. The bias quantifies the error introduced by the model’s inability to represent the true function f(x) and reflects the limitations of the model class (“underfitting”). The variance measures the sensitivity of the learned model to the randomness in the dataset.
Since we can’t draw an infinite number of datasets, the bootstrap method allows us to estimate \mathbb E[f_{\mathbb D} (x)] in practice. We resample datasets from the one dataset we have, using sampling with replacement. Then we train multiple models on these bootstrap-resampled datasets and use the average prediction across bootstrap models to approximate bias and variance.

ai_is_cool · June 13, 2025, 8:24am

Thanks but I am unfamiliar with some of your notation as it is not used in Andrew’s course.

What do you mean by “draw a training set”?

What does this notation mean |\mathbb D|?

It’s not clear to me what the bootstrap method is or bias-variance decomposition.

conscell · June 13, 2025, 11:04am

When we say “draw a training set”, we mean randomly sample a set of input–output pairs \{(x^{(i)}, y^{(i)})\} from some underlying data-generating distribution p. This constitutes one draw of a training set \mathbb{D} \sim p^{|\mathbb{D}|}, the product distribution over |\mathbb{D}| i.i.d. pairs. The notation |\mathbb{D}| means the size of the dataset \mathbb{D}, i.e. if \mathbb{D} = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(n)}, y^{(n)}) \}, then |\mathbb{D}| = n.
Let me know if you’d like me to add an implementation of the bootstrap method to the notebook.

ai_is_cool · June 13, 2025, 3:08pm

What is a “product distribution”?

What does “i.i.d.” mean?

Also I still don’t know what the bootstrap method is or bias-variance decomposition as these terms are not used by Andrew in his video lessons up till now.

ai_is_cool · June 13, 2025, 4:27pm

How do you get from:

\rm MSE(x) = \sigma^2 + \mathbb E[(f(x) − \mathbb E[f_{\mathbb D}] + \mathbb E[f_{\mathbb D}] - f_{\mathbb D}(x))^2]

to:

\rm MSE(x) = \sigma^2 + (f(x) − \mathbb E[f_{\mathbb D} (x)])^2 + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2]

conscell · June 13, 2025, 9:13pm

It is a way of describing multiple independent and identically distributed (i.i.d.) samples. Each data point is drawn independently from the same distribution.

Although the bias–variance decomposition isn’t covered in this course, I think it’s a useful extra material. The bootstrap method is simply a practical tool to estimate bias and variance using the data you already have.

ai_is_cool · June 13, 2025, 9:21pm

I see.

Thank you.

conscell · June 13, 2025, 9:41pm

Taking into account that f(x) − \mathbb E[f_{\mathbb D}(x)] is a constant, we have:

\begin{align} {\rm MSE}(x) = &\ \sigma^2 + \mathbb E[(f(x) − \mathbb E[f_{\mathbb D}(x)] + \mathbb E[f_{\mathbb D}(x)] - f_{\mathbb D}(x))^2] \\ = &\ \sigma^2 + \mathbb E[(f(x) − \mathbb E[f_{\mathbb D}(x)])^2] + \\ & 2 \mathbb E[(f(x) − \mathbb E[f_{\mathbb D}(x)])(\mathbb E[f_{\mathbb D}(x)] - f_{\mathbb D}(x))] + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2] \\ = &\ \sigma^2 + (f(x) − \mathbb E[f_{\mathbb D}(x)])^2 + \\ & 2 (f(x) − \mathbb E[f_{\mathbb D}(x)]) \mathbb E[\mathbb E[f_{\mathbb D}(x)] - f_{\mathbb D}(x)] + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2] \\ = &\ \sigma^2 + (f(x) − \mathbb E[f_{\mathbb D}(x)])^2 + \\ & 2 (f(x) − \mathbb E[f_{\mathbb D}(x)])(\mathbb E[f_{\mathbb D}(x)] - \mathbb E[f_{\mathbb D}(x)]) + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2] \\ & = \sigma^2 + (f(x) − \mathbb E[f_{\mathbb D} (x)])^2 + \mathbb E[(\mathbb{E}[f_{\mathbb D} (x)] − f_{\mathbb D} (x))^2] \end{align}

ai_is_cool · June 16, 2025, 11:51am

Thanks Pavel.

Topic		Replies	Views
A model with high variance and bias, how? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	6	558	August 6, 2022
Clear definition of bias and variance Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	1749	June 29, 2023
Week 1 video bias / variance Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	628	July 21, 2021
Week 1- Bias and Variance Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	551	April 9, 2022
Overfitting and underfitting at the same time ? Bias-variance-trade-off Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	695	July 19, 2021

Bias and variance;

Related topics