A model with high variance and bias, how?

During the course, and specifically where Andrew is describing the bias vs. variance trade-off, he mentions an example where a model could have high variance as well as high bias (where both training and test set errors are high).

He later adds to this and make a figure in which a model is partly having too much flexibility, and partly NOT being flexible at all.
While this is theoretically possible, I wonder if this is practically possible?
A neural network is either too big (based on the #parameters) and therefore have too much variance or too small and have too much bias.

All I can imagine for High Bias + Variance to happen (i.e. where training error as well as test error are high) is that the distribution of training and test sets are not the same while the model is suffering from high bias. In that case, I can see how the error in test set could be more than the training set error (which is already high).

But if the training and test set distributions are the same, and the optimal error is close to zero, then a high bias model should not have a very far test error.

Would you please help me find what I am missing?

Hey @Amin.A,
Welcome to the community. A thought provoking question indeed. Let me try to answer it step-by-step.

Number of layers is not the only aspect of neural network that defines bias and/or variance. For instance, you can have a very large neural network, and use linear activation in the majority of layers, and thus, your model will essentially act like a linear model and will perform poorly on the majority of the training set, i.e., high bias. Also, let’s say that there some examples in the training set which can be easily classified by a linear model due to the features being linearly correlated with the variables, and hence, the model may assign large significance to these handful of examples, and could result in a boundary like shown.

Also, as you very well pointed out, if the distributions of training and test sets differ, in that case, the model may perform even worse on the test set, and could result in high variance as well.

Now, this is even more interesting. If the training and test set distributions are same, then also I suppose, a high bias model could have high variance as well. It really depends on what are we classifying as “high”. Let’s say that the optimal (Bayes) error is 0%, Train Set Error (5%) and Dev Set Error (10%).

In this case, we can easily say that the model has high bias as well as high variance, and a difference of 5% in errors is perfectly valid despite of having same training and test set distributions. My point being, it really depends on the dataset and the model. When we say that the training and test set distributions are same, it doesn’t mean that they are exactly the same, and they still can have a considerable difference in the errors, indicating a high variance. Let me know if this helps.

Cheers,
Elemento

Very nice and clear explanation Elemento, thank you very much!

I fully agree that the definition of “high/low” are subjective (your last point). But for the sake of argument, let’s discuss cases where high/low means obviously and clearly high/low.

Also, I agree that saying

large #parameters is the only way to increase the complexity of the model

was not fully correct, and I poorly phrased my statement indeed!

Furthermore, I am aware that we are discussing similar distributions of training and test sets, not using the same set for both training and test sets. We indeed assume that they are not “behaving” differently in the high dimensional space.

Overall, and thanks to your explanation, I am nearly convinced. So once again, thanks!
However, I think I need a bit more discussion on the following part:

let’s say that there some examples in the training set which can be easily classified by a linear model

Sure, if most of the samples in the dataset are linearly separable, then the linear model in your example is actually a good model and the training and test sets would produce low errors. So we have no bias, no variance.
On the other hand, if most of the samples are not linearly separable, then the training and test set errors would remain high, and probably in the same levels. So we have high bias, low variance. Same argument for a case where the ratio of linearly separable samples are 50/50.

I can imagine a dataset where part of the high-dimensional space is linearly separable, and the other part is not. In that case, we still end up having similar errors for training and test sets.

To rephrase and summarize, if the number of linearly separable samples is large, then the model would learn this pattern and work well. And if the number of linearly separable samples is small then the calculated error would not be influenced by these few samples and we would still see large training and test errors (so the bias problem, and no variance).

Does my argument make sense?

Hey @Amin.A,

Your argument indeed makes complete sense to me, since I was something like this too when writing my reply, and I couldn’t find a counter-intuitive example then, but what about this!

Consider that both the training set and test set distributions are quite same, and lets’ say that both of the train set has 80% non-linearly separable samples and 20% linear separably samples, while the test set has 85% non-linearly separable samples and 15% linear separably samples. We train a large neural network with majority of activations as linear (as described earlier) on the train set and validate it on the test set. Now, consider the following errors:

  • Optimal Error as 0%
  • Train Error as 20% (which is expected since the model will perform poorly on the train set due to majority of non-linearly separable examples)
  • Test Error as 35% (5% due to the extra 5% non-linearly separable samples and 10% could be due to the random noise present in the samples, which doesn’t affect the distribution much)

Now, in my opinion, we can refer to this as a high-bias and high-variance scenario. However, let me mention here very clearly, I am once again slightly deviating from the hypothesis, since the hypothesis that the train set and the test set has the exact same distributions is hardly to be found in any practical scenario.

However, if we assume that the train set and test set has the exact same distributions, with no noise present whatsoever, and if a neural network has high bias (which indicates a large error on the train set), then the network should have pretty much the same error on the test set as well (indicating that in the current iteration, the model has high bias + low variance).

Still allow me to mention @anon57530071 here. Hey Nobu, can you please take a look and see if we are missing anything? Thanks in advance.

Cheers,
Elemento

Interesting discussion.

In the case of deep learning bias/covariance discussion, confusions may come from the way to evaluate, i.e, use two samples, i.e, training set and test set, and, it’s not an evaluation on a single sample set.

So, at first, we need to define a training set as a “condition”. It is usually defined as follows.

D = \{(x_1,y_1), (x_2,y_2),...(x_n,y_n)\}

And, an evaluation is done on the other sample with using a model, \hat{f}(x) trained on D, which should be explicitly written as \hat{f}(x;D).

Then, a loss function can be written as follows.

\mathcal L = (y - \hat{y})^2 = (y - \hat{f}(x;D))^2

Let’s calculate the expectation of loss.

\begin{aligned} E_D[\mathcal L] &= E_D[(y - \hat{f}(x;D))^2] = E_D[(\hat{f}(x;D)-E[y|x;D])^2 + (E[y|x;D] - y)^2]\\ &= E_D[(\hat{f}(x;D)]-E[y|x;D])^2] + E_D[E[y|x;D] - y)^2]\\ \ \ &:\\ &= E_D[(\hat{f}(x;D) - E_D[\hat{f}(x;D)])^2] + E_D[(E_D[\hat{f}(x;D)] - E[y|x;D])^2] \\ &\ \ \ \ + E_D[E[y|x;D] - y)^2] \end{aligned}

I skipped most of transformations due to its complexity to write in Latex… :disappointed_relieved:
The important thing is the last equation.

  • E_D[(\hat{f}(x;D) - E_D[\hat{f}(x;D)])^2 is the “Variance” of a model on the 2nd sample set.
  • E_D[(E_D[\hat{f}(x;D)] - E[y|x;D])^2] is the “Bias”(**2) of a model on the 2nd sample set.
  • E_D[E[y|x;D] - y)^2] is not related to a model, since there is no model (\hat{y}) in here.

The third term should include the difference of distributions between two sample set. In addition, there should be noise defined as \mathcal N(0,\sigma^2).

In this sense,

Also, as you very well pointed out, if the distributions of training and test sets differ, in that case, the model may perform even worse on the test set, and could result in high variance as well.

This may not be true… A model evaluation is simply done with two terms, the first one and the second one, and independent to the differences in sample distributions. That’s what I thought… maybe wrong…

And, the meaning of each is;

  • Variance : How the expectation of the model output is affected by the sample set D
  • Bias : How the output of the model is far from the expectation of “real” output E[y|x;D], which should minimize the loss.

So, if a model is “overfitted”, then, the Variance is most likely high, i.e., high-variance. And, a model is not well representing the truth, then, the Bias should be high, i.e., high-bias.

Note that the above equations may need to revisit… But, basically, the relationship among “variance”, “bias”, and “noise” are same…

One thing that I may want to add is… If we refer the legacy statistics, we usually need to think about samples v.s. population. Training set and Test set are just two small sample sets from the large population. So, if we are OK with both training and test sets, but it does not mean it works in the real environment… Lot’s of challenges…

Hi Nubo,

Thank you for taking the time and writing this explanatory note.
So far, I am unable to follow your line of reasoning. There are many sections in which I can not follow. Most likely, due to my lack of knowledge. Most notably, the expectation formulations.

Also in the text, I can not fully comprehend what you mean in several sections. For example:

What is “first one” and “second one” here?

Anyway, I will try to read around this and if I succeed, I will post my understanding here. Hopefully, it will be useful for someone in the future.

Best of luck!

What is “first one” and “second one” here?

Sorry for making you confused. These two terms, i.e., Variance and Bias, of course.

  • E_D[(\hat{f}(x;D) - E_D[\hat{f}(x;D)])^2 is the “Variance” of a model on the 2nd sample set.
  • E_D[(E_D[\hat{f}(x;D)] - E[y|x;D])^2] is the “Bias”(**2) of a model on the 2nd sample set.

And, the 3rd term represents some errors which are not caused by a model, but by some conditions of the 2nd sample set. And, those losses may be caused by differences in distributions, errors in collecting data, and so on.
So, there should be some impacts on the results if the data distribution in the 2nd sample set is different from the 1st sample set, but it is not part of a model evaluation. That’s my view.