Why we split the our data into training and testinb/dev sets?

Why is ratio is large for training example?
Please someone explain to me I am new here to learn deep learning.

Let assume you want to teach a kid how the letter ‘A’ looks so you show him numerous times that this is how you draw ‘A’ or this is how an ‘A’ looks like. Means you have larger set to train i.e. you write ‘A’ in front of that kid (it is large training set split). Then to evaluate (is kid learned that?) you can show him only 10 or 20 drawings of ‘A’ to guess did the kid understand well…

1 Like

Thanks bruh you explained with good example.

Hi @Ahmad_Khalid1

When you’re training a machine learning or deep learning model, you need to ensure that your model is learning from a diverse range of examples and that it can generalize well to new, unseen data. This is where data splitting comes in.

The training set is the portion of your data that you use to actually train your model. It’s like a teacher showing examples to a student. The model learns from these examples by adjusting its internal parameters to fit the patterns and relationships in the data. A larger training set is often preferred because it provides more diverse examples, helping the model learn better.

The testing (or development) set is a separate portion of your data that you don’t use during training. It’s like a final exam for the student. After your model has learned from the training data, you use the testing set to evaluate how well it can generalize to new, unseen examples. This step is crucial to assess the model’s performance on data it hasn’t seen before. The testing set helps you gauge how well your model is likely to perform in real-world scenarios.

Why Split Data?

  1. Performance Evaluation: By evaluating the model on a separate testing set, you get an unbiased estimate of its performance. This gives you a sense of how well your model is doing and helps you make improvements if needed.
  2. Preventing Overfitting: If you evaluate your model on the same data you used for training, it might seem like it’s performing well, but it could simply be memorizing the examples. The separate testing set helps you catch overfitting, where the model doesn’t generalize well and instead memorizes the training data.

In summary, data splitting into training and testing/development sets is a foundational practice in machine learning and deep learning. It ensures that your model learns from diverse examples and can be evaluated on unseen data, helping you build models that perform well in real-world situations.

Best regards

1 Like