Using cross validation when train & dev come from different distributions?

Hi! In Course 3, Dr. Ng talks about cases when Train can come from a different distribution than Dev & Test. Is there a way to have mismatched distribution and still do cross-validation? Or is my leaning correct that when Train and Dev are sampled from different distributions, it’s best to avoid CV in this situation and just split into Train / Dev / Test?

HOW I’M THINKING ABOUT IT
My case is a multilabel classifier with 3 classes. I want to use different Train and Dev/Test sets (because I can enrich Dev/Test with some better real-world data, and give Train some of that too but fill Train mostly with other data that’s relevant but slightly less the main use case). I don’t have a ton of labeled data, ~20K samples. Maybe with 3 classes and a classical ML model like RandomForest, that’s plenty of data (?), but I figure CV never hurts except taking more time, so use CV if I can?

With a typical use of CV, I’d split all the labeled data into Train / Test, make say 10 folds, and in each fold sample training and dev subsets from Train. So some data in the training subset of one fold could be in the evaluation subset of another fold.

But now Train and Dev need to come from different distributions. I need to keep them separate. Per Dr. Ng’s course, I assume I only want to estimate variance in each fold with the Dev data (to better hit the bullseye of the real-world use case).

How would I set up the data in each fold so that I’m only evaluating the fit on Dev AND cross-validation is doing something meaningful?

My only idea is to evaluate each fold’s fit on a subset of Dev. But it doesn’t feel productive:

  • Before CV: Train | Dev | Test
  • CV Fold 1: fit model to all of Train + 4/5th of Dev | Evaluate on 1/5th of Dev
  • CV Folds 2-5: resample, changing which 4/5th of Dev gets added to the fit and which 1/5 I use to estimate variance

So all of Train always stays in the fits. The only thing that changes in each fold’s fit is this little whipped cream topping changes, a different 1/5 of data from Dev is added to all of Train. How much practical difference does swapping out that 1/5 into each fold make? The fit would seem to be very similar in each fold …unless I guess
a) you have an enormous dataset (then you don’t need CV?)
b) and a small number of folds (then CV is less valuable?)
c) and/or you do an unusual % in the split between Train/Dev in each fold (which only seems reasonable if you have a ton of data in Dev, and then you might need different distributions from Train vs Dev/Test in the first place?).

So I’m not seeing how this helps. Feels very different in function than the typical CV situation when you’re sampling the fold’s training and dev dataset from the same distribution.

Am I thinking about this right? Or is there a standard way to make CV useful when Train and Dev come from different distributions?

Thanks!!

Hi @carloshvp - I posted this question to the Course 3 forum back in September. I haven’t heard any responses. Is there a way to run it back in front of Mentors in general for assistance? Or is it a hopeless question? ?

PS: Hope you don’t mind me mentioning you directly. Forgive me if that was bad form. I don’t know how to the @ mention the Mentor crew in general, and I saw you are a Mentor who had posted recently. :smiley: Thank you either way!

Hi Cparmet,

Here’s a thread that could help you with some information based on your query.

Also, if you would like to know more about cross validation and other implications on the datasets (with error analysis), you can refer to Prof. Ng’s book “Machine Learning Yearning”.

Thanks.