Coverage of k-fold cross validation and other splitting strategies

Hi,

I have finished the MLS and I am in the middle of the DLS now.

In all of the courses I have seen from Andrew, the split in train/dev/test seems to be “fixed” right from the start, i. e. it seems that Andrew recommends to split the available data at the beginning of the project only once (randomly) and then stay with this exact split until the end of the project.

However, I find it an appealing idea to “shuffle” this split repeatedly during the project, and my intuition is that if i choose only one split randomly at the beginning and stay with it, I might be unlucky that “by chance” this split has in implicit bias in one of the generated data sets. For example, I might have “by chance” a significant number of mislabeled data only in the dev set or some other bias that leeds to different dev/test distributions. Even if the probability for this might be small, why should I take the risk? (The risk increases when I have to deal with smaller datasets.)

I heard of some approach called “k-fold cross validation” which seems to propose a different strategy, which is to sytematically re-shuffle the train/dev/test split. Is there a reason why Andrew does not talk about this possibility or will this be talked about later in the course?

Here is some description of this approach:

To you as practitioners: Do you follow Andrew’s method and just do the split once in your projects? Or do you also have different approaches (maybe even different from k-fold cross validation).

Thanks for any hint. Hope this fits in the “General discussion” forum for now. Please feel free to move this post to a more appropriate place if there is one.

Best regards
Matthias

I think Prof Andrew is saying to do the split in the beginning because the projects he assumes one to be involved with have a lot of data and k-fold cross validation is not practical if you have a lot of data! On the other hand if your data is limited then k-fold cross validation its a very nice technique to maximize accuracy and training performance.

I was gonna ask the same question and found out it was already asked by @Matthias_Kleine
I realized this problem with my titanic experimentation. When I re-run the code all over, I get a different result. I kinda cheated by stopping re-runs in the best cvs result but that was cheating myself =) @gent.spah 's answers seems convincing sinve even my modest nn model for an extremely small data set takes 10-20 seconds to train.!
regression models doesnt take that much amount of time.
so can you help me to understand this code?
do I send all the training and test data to this code so that it gives me the validation scores for all the folds?

here is the code
model=LinearRegression()
scores=cross_val_score(model, X, y, cv=10)

so I dont need to split my data into training and cross validation right? I just need to split it for test ing purposes rather than validation.

If this code is right, this is using scikit learn right? Then i believe the x and y sets are split by the function with a cv=10 meaning 10 sets. In each run 1 set is for validation and 9 for training.

No need for you to split them manually to 10 sets.

1 Like

thank you @gent.spah