Train_test_split -vs- KFold/StratifiedKFold

cajumago · April 20, 2023, 2:29am

Hello everyone!

I was checking my notes in this lecture (C2_W3) and Dr. Andrew Ng didn’t mention KFold/StratifiedKfold. However, I’m curious what you guys think about which one works better in the real world: test_split -or- Kfold/StratifiedKfold ?

I’m eager to know your thoughts. Thanks in advance!

Mujassim_Jamal · April 20, 2023, 4:38am

Hi @cajumago, Welcome back to the community !!

Prof. Andrew Ng chose to focus on a select number of machine learning topics and techniques in his Machine Learning Specialization due to constraints such as limited time, relevance to a broad range of applications, and complexity. By covering the most commonly used and relevant methods, prof. ng aimed to provide students with the tools needed to solve real-world problems within the course’s timeframe.

When it comes to splitting data into training and testing sets in machine learning, there are a couple of methods that are commonly used. train_test_split is a simple and efficient way to randomly divide the data into training and testing sets, which is great for smaller datasets. However, it’s not always the most reliable method because the results can be influenced by the particular samples that are chosen. On the other hand, KFold/StratifiedKFold methods are a more robust way to split the data, especially for larger datasets or those with class imbalance. This method splits the data into k folds and then repeats the process k times, using each fold as the testing set exactly once. This helps to reduce any variation in the data and provides a more accurate estimate of the model’s performance. The StratifiedKFold method is particularly useful for imbalanced datasets because it ensures that each fold has a proportional representation of the target classes.

So, the choice between train_test_split and KFold/StratifiedKFold methods depends on the specific needs of the project. If you have a small dataset or you’re just prototyping, then train_test_split is probably fine. But if you have a larger dataset or need a more accurate estimate of the model’s performance, then KFold/StratifiedKFold is likely to be a better choice.

Regards,
Mujassim

Topic		Replies	Views
Coverage of k-fold cross validation and other splitting strategies AI Discussions	4	78	February 4, 2023
Data Splittting Strategy in Supervised ML Supervised ML: Regression and Classification week-3	15	265	March 8, 2024
K-Fold Cross-validation AI Discussions	1	385	August 27, 2023
K-fold cross validation Improving Deep Neural Networks: Hyperparameter tun	10	608	April 28, 2022
How should I split the dataset to train/test/dev if dataset changes frequently? Structuring Machine Learning Projects	3	587	June 29, 2021

Train_test_split -vs- KFold/StratifiedKFold

Related topics