V(alidation) or C(ross)V(alidation)?

So, this question is for anyone, but @paulinpaloalto if you have a moment I’d be especially curious what you think because you’ve seen MLS and DLS in their various iterations.

So I might have mentioned I did my ML elsewhere, so started here with just DLS and am now almost finished NLP.

When I did ML, we actually spent quite a lot of time / detail going over the topic of CV or Cross Validation. I mean yes, here in DLS we almost always have a V or Validation Set as well for the purposes of say hyperparameter tuning / model experimentation. But in the end we only seem to have one.

At first I thought this just might be an oddity about DLS, but NLP takes the exact same tack.

With CV, we’d have many validation sets, also re-splitting the train set each time, etc. Then there was all about using bootstrap methods, etc.

So I’m a little confused now, why has that thinking seemed to have ‘disappeared’ from the methodology here ? I mean even a lot of the bootstrapping methods made sense in the case, say you were facing time constraints, limitations on available data, or weren’t like Zuck with unlimited compute and just 16k NVidia H100s sitting around…

Can anyone explain to me the difference/why it is not discussed here ? Thanks.

I am not familiar with what you are describing as “CV” and have not previously heard references to “bootstrap” models. So I believe that simply means that those topics are not covered in any of the courses I’ve yet taken here. The deepest treatment I have seen of managing training/cv/test data and the whole model training and evaluation process by Prof Ng are in DLS C2 and DLS C3.

In DLS C2, he describes the need for three different sets of data: training data, cross validation data and test data. We use training data to train the model, then what he calls cross validation data to tune the hyperparameters. Then when we finally have a model that works well with training and cv data, we finally use the “test” dataset get a fair evaluation of the performance of the model on data that it has never “seen” before.

He covers more sophisticated scenarios in which the various datasets can come from multiple statistical distributions in DLS C3.

Do you have a reference to any websites that describe the other definition of “Cross Validation” that you are describing above? But I think the overall answer to your question is that it simply isn’t covered here. Prof Ng is a pioneering expert in the field, of course, so you can assume he’s familiar with pretty much any technique that’s been evolved over the years. But there’s a finite amount of material in any given course or specialization, so he gets to pick and choose what he thinks is most relevant for the various courses here. So either the technique has fallen out of favor or perhaps it’s more sophisticated than anything we’ve seen here. I hope we’ll get a sense for which direction is relevant based on further investigation.

Just quick response for the moment and I will get back in more detail in a bit.

These are two of the sections we used in our textbook. CV is first, and then bootstrapping comes right after.

https://rafalab.dfci.harvard.edu/dsbook-part-2/ml/resampling-methods.html#cross-validation

1 Like

Thanks for the link. My evening is looking a little booked, so I will probably not have a chance to read the book until tomorrow.

1 Like

Hey Paul,

So I consulted an outside source on this (not my former teacher Prof.) and consulted my reference text–

and an even more brief chapter is mention–

So my present conclusion is there is a bit of a ‘strife’ in the field between ‘does what we know’ matter (i.e. hand crafting models, designing input features)-- Or, ‘it doesn’t matter-- all we need is a million pounds of data’ and the network will just ‘figure it out’.

I am well aware Prof. Ng completely knows what he is doing, as well as Prof. Irizarry.

However, I can understand if you use CV and then try to scale up (and tune) on all the hyperparameters on a traditional NN task this becomes incredibly complicated.

For myself, I must admit, I think neural nets are ‘neat’… But I am not yet a believer in ‘lets just throw’ all the data at it (i.e. diffusion/transformers).

We just have not thought clever enough yet. And for anyone else, I don’t think full CV is a bad idea, to me it makes sense.

Perhaps it is just not often explored/talked about.

Hi @Nevermnd sometime ago I came across cross fold validation and apparently, although the method seems to be more investigative and helpful in tuning the model, it also has a running computation cost, which becomes larger and larger as the dataset increases.

Especially for Neural networks where the dataset maybe be huge this would be counter productive and it will take a large amount of time to finalize training. Also dont forget that you run training for many epochs and every epoch the training, validation sets could be shuffled (it might have some negative implications this), basically some kind of cross validation but without taking as much time as cross validation.

Ideally this is the goal of an AI agent to learn with as little human interference as possible, right?

Hello Anthony @Nevermnd!

I think our only goal is: to get a performance score that we can trust.

If I have a small dataset, with a 80/20 split, I would actually wonder how representative (to unseen data) both the training and the validation sets are. This is why I will do a CV to make sure I do not receive a good score by a lucky split. Since both sets are small, I will expect to see the validation scores among different folds to vary a lot.

As I get a bigger set, I can expect to see less variation among the folds, and to a point when the variation is so small that is of little interest, I will begin to think whether I would still need to do a CV.

Dataset size should be related to the variation of CV scores. This you can easily experiment with any dataset and progressively mask out a certain percent of the data. The more you mask out, the higher the resulting CV variation should be.

On the other hand, our network size usually scales with our dataset size, therefore, we will see the following relation:

NN size ↔ Data size ↔ CV variation ↔ Need to do CV?

Small data usually means smaller NN and larger CV variation, and a larger variation requires CV for us to statistically tell how large the variation is. The larger the variation, the less our confidence to any one score value.

I think the above is a good argument for us to decide that a large dataset can expects very small variation and thus k-fold CV won’t be very meaningful, and then with a small dataset, we need CV to tell us how confident we are to any one validation score we get.

Cheers,
Raymond

2 Likes

I think ‘sort of’.

Or it has to be recalled we are the ones that select what data to give the model right ? And this fact alone explains so much of the bias that comes out.

I mean, say in a ‘famous case’, and who knows, I am not sure, but I don’t think (or hope) the Google engineers said to themselves ‘Oh, let’s represent black people as gorillas’ [in images]-- rather they just missed the balance in their train set, which might have not been obvious to them.

But to me, this is the bigger thing, the nets still only learn what we train them.

I think you are speaking more towards ‘unsupervised’ learning and I’m not sure we’ve gotten all that far at this point… But if you have an idea, drop me a note ! :grin:

1 Like

I meant to learn itself from the data, but yes the data can be biased and what data is not biased for that matter…