These questions are all pretty subtle and require some careful thought. In this example, we see that the training distribution includes both the forward facing data (from the same distribution as the dev/test data) and the Internet data (a different distribution). But the model does worse (higher error rate) on the training distribution plus the dev/test distribution than it does on just the dev/test distribution. Usually you expect the model to do better on the training data than on the dev/test data (at least mild overfitting) or in the perfect case to be equal on the training data to the dev/test data. So if it does better on the dev/test data, that suggests that the dev/test data is easier for the trained algorithm to correctly identify. So at the very least, you would conclude that the Bayes Error is most likely equal to or lower on the dev/test data than it is on the full training distribution.
It doesn’t prove that of course, but it would not be likely that the dev/test data is “harder” (has a higher Bayes Error) than the full distribution, given the behavior posed in the question.
The semantics are pretty subtle here. If what I said above does not convince you, then it might be worth listening to the relevant lectures again.
ok, thx. I have just not very understand how can we conclude about bayes errors by errors in different splits. I thought that Bayes error is something like a constant for all data. But, ok, I’ve understood from your answer, that we can approximate it because dev and test data came from an another source.