The consequence of different distribution in train dev and test

I have two question on this topic.

  1. Why is it not okay to have different data distributions?
    Andrew used the analogy that the dev set is the target we aim at using the train set, and the test set is the actual target we want to hit, so if the two targets are not the same, we would spend a lot of resource aiming at the wrong target. But I thought we train the model to learn the mapping given the distribution.
    But I am confused with this analogy. I thought we train models to learn the “mapping” of the data, not the distribution of the data. So the target should be the mapping function, not the distribution of the data. If mapping is the same between the dev and test set, the target should also be the same.
    For example, suppose the true mapping between input x and output y is y = x^2. For train and dev set, we have x sampled from a normal distribution, and for test set we have x sampled from a exponential distribution . Now the distribution has changed, but as long as the model learned the mapping, which is y = x^2, it should work perfectly well on the test set!

  2. From the course I learn that it is okay if the train and dev set have different distributions, but the dev and test set should have the same distribution. Why?


Hey @Sara,

I love your questions :slight_smile:
These two will be easy to explain though.

We train our model on the training set and evaluate the model on dev and test sets. In a sense, the purpose of the test set is to make sure that our evaluation on the dev set is correct (we expect both dev and test errors to have close values).

The purpose of dev and test sets to perform evaluation on them. If they had different distributions we wouldn’t be able to compare errors and reason about results.

The true data distribution is unknown. We only have access to empirical data distribution – our training set.

  • If we evaluate our optimization of model parameters using data that came from the same distribution as our training set, we will be doing our best (we aim almost to the true target).
  • If we evaluate our optimization using data that came from a different distribution, we may still get some decent results, but they will be worse than with the previous option (we aim further from the true target).