Can we reweight training data?

Say we have a training set (X_train, Y_train), a subset of this data is always less representative or models always perform worse on.

Can we give that subset more weights when training the model? What are the pros and cons we need to consider when using this approach?

You can. Definitively. However, how will this affect the distribution of your data vs the distribution of the data that the model will receive in production? Any change in the distribution will definitively affect the performance of your model in production. It may give you great results in training, but it may be different in production. So that’s what I would check out first.

Thanks Juan! Yes re-weighting the training set will make the distribution of data different in training and prod.

One follow up question:
Is it important that we always keep the distribution of training data and production data as similar as possible?

Hello @Haoting_Wang,

It is the best if all of them (training set, cv set, test set) have the same distribution as the production data. Sometimes you have a very very large training set (like for training computer vision model), then you want to make sure at least the cv set and the test set which are usually much smaller and managable to be in the same distribution as the production data. The cv set is for model selection and you want that to be accurate. The test set is for generalization error estimation and you need that to be accurate too.

Cheers,
Raymond

Yes. It is important that the training data be as much representative as possible as the data that the model will see in inference. When there’s a difference, you’ll probably see decreased results.