I have a question regarding a topic covered in the Week 1 quiz of Course 3, where it is mentioned that the training set may have a different distribution than the dev and test sets.

I am having difficulty understanding why a different distribution between the training and dev/test sets does not impact the results. Additionally, I’m not entirely clear on the concept of distribution. Since I haven’t yet encountered a real-world case, my understanding of the train/dev/test set is still quite abstract.

I would greatly appreciate it if someone could provide clarification on this matter, share relevant sources, or offer any insights that might help solidify my understanding.

I will try to break your post into points to make it clear for you.

Let’s start by defining the different type of sets and their purposes:

Training Set:

This set is used to train the machine learning model. It consists of input-output pairs that the model learns from.

Development (Dev) Set / Validation Set:

This set is used during the model development phase. It helps tune hyperparameters and evaluate different models to select the best-performing one.

Test Set:

This set is kept separate and is not used during the training or model development. It’s used to assess the final performance of the chosen model.

Now after we defined the different type of sets let’s address another point which is “distribution”:

The term “distribution” refers to the pattern or spread of data. For instance, if you’re training a model to recognize cats and dogs, the distribution includes various images of cats and dogs.

When the training set has a different distribution than the dev/test sets, it can pose challenges because the model may not generalize well to unseen data. The model might be too specialized and perform poorly on data with a different distribution.

Why it May Not Impact Results?:

Transfer Learning:

Techniques like transfer learning involve pre-training a model on one task or dataset and fine-tuning it on another. This can help mitigate distribution differences.

Robust Models:

Some models are inherently robust to variations in the data distribution, especially if they are designed to generalize well.

Data Augmentation:

Techniques like data augmentation, which involve creating new training examples through transformations, can make the model more robust to distribution differences.

So in practice, it’s ideal for the training, dev, and test sets to come from the same distribution to ensure the model generalizes well. However, in certain situations, the impact of distribution differences can be mitigated through the strategies mentioned above.

If you encounter a real-world case where the distributions are significantly different, it’s essential to carefully consider the potential impact on model performance and explore techniques to address the issue.

I hope it’s more clear for you now and feel free to ask for more clarifications anytime.
Regards,
Jamal

Hello @Thrasso00 a probability distribution is formaly defined by:

A set of events A.

A set of all the subsets of A named \mathcal{A} such that:

A \subset \mathcal{A} the entire set A is contained in \mathcal{A}.

if B\subset \mathcal{A} then B \ A\subset \mathcal{A} close under complementation.

if B_0, B_1, B_2, ... \subset \mathcal{A} then B_0 \cup B_1 \cup B_2 \cup B_3 … = A close under finite union.

A function P : \mathcal{A} \to \Bbb R such that:

P(X \in B_0) \ge0 for all X \in \mathcal{A}.

P(X \in B_0) \le1 for all X \in \mathcal{A}.

P(X \in \cup_i B_i) =\sum_iP(X \in A_i) for any countable disjoint family of sets {A_i}.

A function that satisfies the conditions above is called a measurable function.

and 2. are called a sigma algebra and 1. 2. and 3. are called a probality space.

A distribution in this context is function P in the probability space {A, \mathcal{A}, P} .

Concretely let:

A be the set of faces of a fair coin {H, T} according to 1.

Resulting in \mathcal{A} = {{}, {H}, {T}, {H, T}} according to 2.

With P(X) (the probability distribution) corresponding to P({}) = 0, P(A) = 1/2, P(B) = 1/2 and P({A, B}) = P(A) + P(B) = 1/2 + 1/2 = 1 according to 3.

The example above is a particular case of a Bernoulli distribution.

I hope helps make clear what a distribution is in this context.

I apologize, but I’m a bit uncertain about this. I believe the context is more related to Computer Vision Projects. My confusion stems from a specific question in the quiz:

I’m curious about why the origin of the picture plays a role in the distribution. In this hypothetical scenario, how could we induce a change in the distribution?

For instance, if we’re capturing images of the sky, does a different distribution imply a higher proportion of pictures with birds, or does it refer to an increase in for exemple blurry images, or perhaps something else?

I’d appreciate any clarification on this matter. Thank you!

Following your example if you take a picture of the sky in New York you could have say a 1/3 chance of capturing a bird and 2/3 chance of not capturing a bird (call that distribution 1).

On the other hand if you are in the amazon forest and take a picture of the sky you could have a 9/10 chance of capturing a bird and 1/10 chance of not capturing a bird (call that distribution 2).

Then distribution 1 is different from distribution 2. If you trained your model only with pictures from distribution 1 it will be biased and will not generalize well in a setting like the one with distribution 2.

However, you are not hopeless because you can make data augmentation by taking pictures of the sky in say Shenandoah National Park wich will have a closer distributition of birds in the sky to a distribution 2.

In that manner the impact of the bias in your original training set is mitigated by the addition of pictures from another set in different circumstances.

A similar logic can be applied to the other two solutions proposed.

The connection with all the “formalism” in my previous post can readly be made. Distribution 1 is Bernoulli with probability 1/3, distribution 2 is also Bernoulli but with probability 9/10, and the distribution of new images would be also Bernoulli with probability p > 1/3. And all the conditions of the “formalism” are satisfied.