Why training data size is 50884 in C3W2PracticeLab2?


Why the training data size is 50884 instead of 25521?

The reason is in the text of section 3.1.

“Some ratings are repeated to boost the number of training examples of underrepresented genre’s.”

How repeating the same ratings ( training examples of underrepresented genre’s) can help ?
For example, if “documentary” has only 1 rating. Then repeating it for 100 times would help? It’s still same data.

Repeating underrepresented samples doesn’t provide new information, but it shifts the balance towards the underrepresented samples, in the hope that the so-trained model reflects that. That’s the idea.

For more, you might want to look into topics like “imbalanced data”, “data augmentation” with a focus on how we can deal with imbalanced data in order to achieve a more balanced performance over the whole data space of concern.

Cheers,
Raymond

Will do so! Thank you!