DLS 3 W1 Train/Dev/Test Distributions

This lecture emphasizes need to assure that dev set and tets set come from same distribution. Shouldn’t all three - train, dev, test come from same distribution?

And given a a set of examples (from which we would create train, dev, test sets) - any methods to verify if all examples come from same distribution which is also representative of expected future data?

Hello @dds,

Speaking of distribution, it’s going to be a distribution of something. When it is the distribution of labels, you may plot the distribution of labels in train / dev / test set together and look at their difference. You may also use KL divergence to calculate the similarity of these 3 distributions by comparing 2 at a time.

However, besides labels, there are many other things that their distributions are hard to measure. Let me quote from the video:

But maybe your users are uploading, you know, blurrier, lower res images just taken with a cell phone camera in a more casual condition. And so these two distributions of data may be different. The rule of thumb I’d encourage you to follow, in this case, is to make sure that the dev and test sets come from the same distribution.

Blurriness is one of such thing that’s hard to mesaure its distribution, unless you know what quantity best represents it.

You need to define a set of quantities first, then you may compare their distributions.

Future is full of unknown, unless you can constrain the environment where your model works in. For example, can you limit your user to only supply you images that are clear, up-right, good photo resolution, and on and on? You know sometimes it is not practical at all.

It’s more likely if you split your dataset in a randomized manner and when your dataset is very big. For example, if you had sorted your data somehow before splitting, they’re unlikely to share the same distribution.

Cheers,
Raymond

Here in Course 3, Prof Ng is dealing with the kinds of sophisticated situations you run into in large scale real world projects. He specifically discusses the case that the training data may be from a different distribution than the test and dev data. It’s been a while since I listened to these lectures, but I think he even explains why and how that sort of situation would arise. This is the section where he also talks about subdividing the training data into the training set and the “training-dev set”. If you missed that, you should listen to those lectures again. Or maybe you haven’t hit that case yet, so in that case “stay tuned” for more explanations from Prof Ng about how and why that sort of situation can arise.

As I understand - there are three example sets: training, Dev, Test. Ideally, all three should be from the distribution (population) otherwise the basic underpinnings of statistical learning would not hold.

Prof. Ng does explain why Test set could be different from Dev set, specifically, where training and dev set examples are sampled from high res web images but test set images come from lower res mobile phones. The suggested remedy is to change the evaluation metric and/or make dev test representative of test set.

There are two scenarios of concern:

  • Suppose we made the dev set statistically similar to test set. In practice, the real world could be quite different from test set we use. There is enormous diversity in mobile phone camera qualities and real-world situations where image is captured (foggy low visibility day to a cat beauty paegent show). In practice we would rarely be able to control the external environment. And our trained models will not be reliable. Rather than keep on tuning the model may be we should include some mechanisms for the model to inform us that the examples being shown are outside the scope of its training.

  • Second situation is that suppose we did make dev set and test set distributions same. Would we then also not want to make training set distribution same as test set (test set being a proxy for real-world). If so then we may not have enough training examples to work with. We often see start-ups struggle with this problem and they choose to use what data is available to train and propose applying the model to situations where they do not apply (one example is using high res radiograph trained model being used with a lower cost low res device and assuming that AI/ML/DL will bridge the gap).

forgot to mention - the low res low cost device is yet to be created, but AI/ML/DL is used as justification for the business case - essentially the argueing that the lost of resolution in low cost scanning device can be overcome by the AI/ML/DL.

I agree! Statistical learning isn’t robost to cases where samples are in short. I think it is easy to generate low res photos from high res ones, but there are other cases we can’t create from nothing.

This one is interesting, including mechanisms founded on knowledge generalized by human wisdom may be a way out!