Building a data set - Validating on Replicates

Hi network

I have a question regarding building a data set for a machine learning project. If there are duplicate sample responses (i.e., the same measurement performed twice on the same sample), how can we utilize this data effectively?

Training Set: include the duplicate sample responses in the training set to help the model become well-tuned and robust within the distribution of the training data.
Cross-Validation Set: to provide additional insights into how well the model is performing across different folds of the data.
Test Set: train the model on one replicate and then evaluate its performance on the second replicate.

Thanks a lot for any feedback.
I enjoy a lot the course.


The split between Training, CV, and Test should be random. Initially, you should not hand-pick how the data is distributed.

There may be cases where you would fine-tune the CV and Test sets, but in general it’s not a great idea.