Train_test_split -vs- KFold/StratifiedKFold

Hello everyone!

I was checking my notes in this lecture (C2_W3) and Dr. Andrew Ng didn’t mention KFold/StratifiedKfold. However, I’m curious what you guys think about which one works better in the real world: test_split -or- Kfold/StratifiedKfold ?

I’m eager to know your thoughts. ​Thanks in advance!

Hi @cajumago, Welcome back to the community !!

Prof. Andrew Ng chose to focus on a select number of machine learning topics and techniques in his Machine Learning Specialization due to constraints such as limited time, relevance to a broad range of applications, and complexity. By covering the most commonly used and relevant methods, prof. ng aimed to provide students with the tools needed to solve real-world problems within the course’s timeframe.

When it comes to splitting data into training and testing sets in machine learning, there are a couple of methods that are commonly used. train_test_split is a simple and efficient way to randomly divide the data into training and testing sets, which is great for smaller datasets. However, it’s not always the most reliable method because the results can be influenced by the particular samples that are chosen. On the other hand, KFold/StratifiedKFold methods are a more robust way to split the data, especially for larger datasets or those with class imbalance. This method splits the data into k folds and then repeats the process k times, using each fold as the testing set exactly once. This helps to reduce any variation in the data and provides a more accurate estimate of the model’s performance. The StratifiedKFold method is particularly useful for imbalanced datasets because it ensures that each fold has a proportional representation of the target classes.

So, the choice between train_test_split and KFold/StratifiedKFold methods depends on the specific needs of the project. If you have a small dataset or you’re just prototyping, then train_test_split is probably fine. But if you have a larger dataset or need a more accurate estimate of the model’s performance, then KFold/StratifiedKFold is likely to be a better choice.