Concept not so clear on dataset shift, covariate shift and concept shift

Feihong_YANG · April 30, 2023, 6:39am

In the introduction course which introduced the concept data drift and concept drift and I think could understand the definition. But in course 2 the definition is quite confused to me. In the class “Detecting Data Issues”, the mentor divide the concept into 3 different categories, Dataset shift, Covariate shift and Concept shift. For the Concept shift I might map it to the original definition of Concept drift, and Covariate shift and Concept shift to Concept drift. But to differentiate between Dataset shift and Covariate shift is quite challenge to me. According to the definition, the Dataset shift refer to that

Ptr(y,x) != Pserv(y,X) and Covariate refer to Ptr(y|x) = Pserv(y|x) but Ptr(X) != Pserv(x).

I want to know if the issue we found relate to Covariate shift,
then Ptr(x) != Pserv(x),
and if this is a valid then wouldn’t it impact P(x,y) given P(x,y) = p(x)*P(y)?
If this is impact wouldn’t it satisfy the condition of Dataset shift?
BTW, I also did some exploration on the internet and found some different way of definition, like in this article the author put dataset shift as a super class of the rest 2 shift which made me more confused.

Another instance relating to instance the the mentor introduced “average message sent per minute” as an example of Data drift. Per the example the data record relating to aver msg sent / m increased with the day changed. Yes the data record stat indeed changed since the record mean changed. but the map P(y|x) also changed, is that because there is a change on P(x) already so we don’t classify it a concept change?

canxkoz · April 30, 2023, 8:01am

Dear @Feihong_YANG ,
Thank you for asking this question. I will do my best to answer your question. If you have further questions please do not hesitate to ask.

Below are the definitions of some of the terms that you may feel confused about.

Dataset shift refers to a change in the distribution of the data that a model is trained on. This can happen for a variety of reasons, such as changes in the population that the model is being trained on, or changes in the way that data is collected.
Covariate shift refers to a change in the distribution of the features in the data, but not the distribution of the target variable. This can happen for example if the features are collected in different ways over time.
Concept shift refers to a change in the relationship between the features and the target variable. This can happen for example if the target variable changes over time, or if the relationship between the features and the target variable becomes more complex.

Now that we are both on the same page in terms of the definitions of the terms, I would like to reply to your question about “Ptr(y,x) != Pserv(y,X) and Covariate refer to Ptr(y|x) = Pserv(y|x) but Ptr(X) != Pserv(x) relating to covartiate shift.”

Yes, the issue you found could be related to covariate shift. If the distribution of the features in the training data is different from the distribution of the features in the test data, then the model will not be able to generalize well to the test data. This is because the model will have learned to predict the target variable based on the distribution of the features in the training data, but the test data will have a different distribution.

In your example, the change in the average number of messages sent per minute could be due to a change in the way that data is being collected. If the data is being collected from a different population, or if the data is being collected in a different way, then the distribution of the features will change. This could lead to covariate shift, which will impact the performance of the model.

If the distribution of the features and the target variable both change, then this is called dataset shift. Dataset shift is more severe than covariate shift, as it can cause the model to learn the wrong relationship between the features and the target variable.

About your second question on " is that because there is a change on P(x) already so we don’t classify it a concept change?"

Yes, you are correct. If the distribution of the features changes, then the relationship between the features and the target variable may also change. This is called concept shift.

In your example, the average number of messages sent per minute increased with the day. This could be due to a number of factors, such as people being more active on social media during certain times of the day. If the model is not updated to reflect this change, then it will not be able to predict the number of messages sent per minute accurately.

Concept shift is a more challenging problem to deal with than covariate shift. This is because the model needs to be able to learn the relationship between the features and the target variable on an ongoing basis. There are a number of techniques that can be used to deal with concept shift, such as online learning and ensemble methods.

I hope this helps to clarify the difference between concept shift and covariate shift.

Please feel free to ask a followup question if you have any.
Regards,
Can

Feihong_YANG · April 30, 2023, 12:26pm

Hey @canxkoz , really appreciate your answer in detail. Here is some follow up:

If the distribution of the features and the target variable both change, then this is called dataset shift. Dataset shift is more severe than covariate shift, as it can cause the model to learn the wrong relationship between the features and the target variable.

Can I understand it as following summary?

If the P(x) of the dataset on train and test dataset is different but P(y|x) kept he same, then we call it Covariate Shift.
If the P(y|x) of the dataset on train and test dataset is different but P(x) kept the same, then we call it Covariate Shift.
If Both P(x) and P(y|x) changed, then we call it Dataset shift.
Concept drift and Concept shift refer to the same issue, Data drift and Covariate shift refer to the same concept, Dataset shift relate to neither Concept drift nor Data drift since both P(x) and P(y|x) changed between training set and test set and should be classified independently.

For the message sending example, I’m still a little confused. If all my summary items correct, then this should be classified as Dataset shift, means both P(x) and P(y|x) changed (since the decision boundary to classify spam or ham changed), but why the mentor introduced it as Data drift? (or Covariate shift if I’m correct) I did see that P(y|x) also changed.

Thanks,
Feihong

canxkoz · May 1, 2023, 11:08am

Yes, your summary is correct.

Covariate shift occurs when the distribution of the features X changes, but the conditional distribution of the target variable Y given X, P(Y∣X), stays the same.
Concept drift occurs when the conditional distribution of the target variable Y given X, P(Y∣X), changes.
Dataset shift occurs when both the distribution of the features X and the conditional distribution of the target variable Y given X, P(Y∣X), change.

In the message sending example, the distribution of the features X (the words in the message) and the conditional distribution of the target variable Y (whether the message is spam or ham) both change over time. This is because the way people use language changes over time, and the way spammers and scammers try to trick people also changes over time. As a result, a model that was trained to classify spam and ham messages a few years ago may not be as accurate today.

I do not have direct access to the course but it may have been introduced as data drift because data drift is a more general term that includes both covariate shift and concept shift.

I hope I was able to answer your question. Please do not hesitate to post a followup if you feel confused.

Best,
Can

Feihong_YANG · May 1, 2023, 3:19pm

Hi @canxkoz here is the context:

"I do not have direct access to the course but it may have been introduced as data drift because data drift is a more general term that includes both covariate shift and concept shift.emphasized text"

May I understand it like Dataset shift is also one case of Data Drift? In that way then then mentor’s definition is correct, since both the distribution of message sent per min and decision boundary to differentiate spam and ham changed.

And since you said “If the distribution of the features and the target variable both change, then this is called dataset shift” in earlier comment, it neither belong to Concept shift not Covariate shift, we can say Data Drift include 3 scenarios, Dataset shift, Covariate shift and Concept shift. And here Concept shift and Concept drift are the same issue.

Let me know if there is any mistake in my summary.

Thanks,
Feihong

canxkoz · May 1, 2023, 4:15pm

I think your summary is correct.

@Isaak_Kamau could you also confirm that the learner’s summary is correct?

Thanks,
Can

Isaak_Kamau · May 2, 2023, 6:29am

Thanks, @canxkoz for helping out!
@Feihong_YANG Your summary seems to be correct. I also think regarding the message-sending example, if the decision boundary to classify spam or ham changed due to changes in both the distribution of features (e.g., the content of the messages) and the relationship between the features and the target variable (e.g., the definition of spam), dataset shift would be a more inclusive term.

Topic		Replies	Views
Is covariate shift the same as data drift? Machine Learning Data Lifecycle in Production	5	657	March 14, 2023
Is there any other kind of shift / drift belong to prior probability shift but not concept shift? Machine Learning Modeling Pipelines in Production	5	483	September 28, 2023
Isn't Feature Skew one form of Distribution Skew Machine Learning Data Lifecycle in Production	2	540	July 10, 2021
Detecting data issues lecture Introduction to Machine Learning in Production	1	512	September 13, 2022
What is the correct name for data and concept drift? Machine Learning Data Lifecycle in Production	6	600	November 30, 2022

Concept not so clear on dataset shift, covariate shift and concept shift

Related topics