Building a data set - Validating on Replicates

Julie_Frost_Dahl · July 21, 2023, 6:08am

Hi network

I have a question regarding building a data set for a machine learning project. If there are duplicate sample responses (i.e., the same measurement performed twice on the same sample), how can we utilize this data effectively?

Training Set: include the duplicate sample responses in the training set to help the model become well-tuned and robust within the distribution of the training data.
Cross-Validation Set: to provide additional insights into how well the model is performing across different folds of the data.
Test Set: train the model on one replicate and then evaluate its performance on the second replicate.

Thanks a lot for any feedback.
I enjoy a lot the course.

Best
Julie

TMosh · July 21, 2023, 6:33am

The split between Training, CV, and Test should be random. Initially, you should not hand-pick how the data is distributed.

There may be cases where you would fine-tune the CV and Test sets, but in general it’s not a great idea.

Topic		Replies	Views
Questions about automatically choosing model Advanced Learning Algorithms week-3	5	355	August 31, 2023
Week1 quizz: very confused about train/dev/test set and when to add new data to which set Structuring Machine Learning Projects week-1	2	390	February 1, 2024
Train,dev set Improving Deep Neural Networks: Hyperparameter tun week-1	1	13	October 25, 2024
Construction of a machine learning algorithm AI Discussions ai-discussions	2	101	June 2, 2024
Cross validation sets Advanced Learning Algorithms week-3	4	424	July 16, 2023

Building a data set - Validating on Replicates

Related topics