Interesting discussion.
In the case of deep learning bias/covariance discussion, confusions may come from the way to evaluate, i.e, use two samples, i.e, training set and test set, and, it’s not an evaluation on a single sample set.
So, at first, we need to define a training set as a “condition”. It is usually defined as follows.
D = \{(x_1,y_1), (x_2,y_2),...(x_n,y_n)\}
And, an evaluation is done on the other sample with using a model, \hat{f}(x) trained on D, which should be explicitly written as \hat{f}(x;D).
Then, a loss function can be written as follows.
\mathcal L = (y - \hat{y})^2 = (y - \hat{f}(x;D))^2
Let’s calculate the expectation of loss.
\begin{aligned}
E_D[\mathcal L] &= E_D[(y - \hat{f}(x;D))^2] = E_D[(\hat{f}(x;D)-E[y|x;D])^2 + (E[y|x;D] - y)^2]\\
&= E_D[(\hat{f}(x;D)]-E[y|x;D])^2] + E_D[E[y|x;D] - y)^2]\\
\ \ &:\\
&= E_D[(\hat{f}(x;D) - E_D[\hat{f}(x;D)])^2] + E_D[(E_D[\hat{f}(x;D)] - E[y|x;D])^2] \\
&\ \ \ \ + E_D[E[y|x;D] - y)^2]
\end{aligned}
I skipped most of transformations due to its complexity to write in Latex… 
The important thing is the last equation.
- E_D[(\hat{f}(x;D) - E_D[\hat{f}(x;D)])^2 is the “Variance” of a model on the 2nd sample set.
- E_D[(E_D[\hat{f}(x;D)] - E[y|x;D])^2] is the “Bias”(**2) of a model on the 2nd sample set.
- E_D[E[y|x;D] - y)^2] is not related to a model, since there is no model (\hat{y}) in here.
The third term should include the difference of distributions between two sample set. In addition, there should be noise defined as \mathcal N(0,\sigma^2).
In this sense,
Also, as you very well pointed out, if the distributions of training and test sets differ, in that case, the model may perform even worse on the test set, and could result in high variance as well.
This may not be true… A model evaluation is simply done with two terms, the first one and the second one, and independent to the differences in sample distributions. That’s what I thought… maybe wrong…
And, the meaning of each is;
- Variance : How the expectation of the model output is affected by the sample set D
- Bias : How the output of the model is far from the expectation of “real” output E[y|x;D], which should minimize the loss.
So, if a model is “overfitted”, then, the Variance is most likely high, i.e., high-variance. And, a model is not well representing the truth, then, the Bias should be high, i.e., high-bias.
Note that the above equations may need to revisit… But, basically, the relationship among “variance”, “bias”, and “noise” are same…
One thing that I may want to add is… If we refer the legacy statistics, we usually need to think about samples v.s. population. Training set and Test set are just two small sample sets from the large population. So, if we are OK with both training and test sets, but it does not mean it works in the real environment… Lot’s of challenges…