The consequence of different distribution in train dev and test

manifest · May 22, 2021, 8:18pm

I love your questions
These two will be easy to explain though.

We train our model on the training set and evaluate the model on dev and test sets. In a sense, the purpose of the test set is to make sure that our evaluation on the dev set is correct (we expect both dev and test errors to have close values).

The purpose of dev and test sets to perform evaluation on them. If they had different distributions we wouldn’t be able to compare errors and reason about results.

The true data distribution is unknown. We only have access to empirical data distribution – our training set.

If we evaluate our optimization of model parameters using data that came from the same distribution as our training set, we will be doing our best (we aim almost to the true target).
If we evaluate our optimization using data that came from a different distribution, we may still get some decent results, but they will be worse than with the previous option (we aim further from the true target).

Topic		Replies	Views
Do we need training and dev/test data to come of the same distribution? Structuring Machine Learning Projects	2	646	May 5, 2022
Adding Training data which distribution differs from Dev/Test sets Structuring Machine Learning Projects	16	952	December 9, 2024
Question regarding a quiz from "Bird recognition case study" Structuring Machine Learning Projects	1	482	June 13, 2023
Training and set distribution clarification. C3 W1 Structuring Machine Learning Projects	7	354	November 16, 2023
Course 3 Week 1 quiz Structuring Machine Learning Projects	1	567	June 25, 2022

The consequence of different distribution in train dev and test

Related topics