Dear @Feihong_YANG ,
Thank you for asking this question. I will do my best to answer your question. If you have further questions please do not hesitate to ask.
Below are the definitions of some of the terms that you may feel confused about.
-
Dataset shift refers to a change in the distribution of the data that a model is trained on. This can happen for a variety of reasons, such as changes in the population that the model is being trained on, or changes in the way that data is collected.
-
Covariate shift refers to a change in the distribution of the features in the data, but not the distribution of the target variable. This can happen for example if the features are collected in different ways over time.
-
Concept shift refers to a change in the relationship between the features and the target variable. This can happen for example if the target variable changes over time, or if the relationship between the features and the target variable becomes more complex.
Now that we are both on the same page in terms of the definitions of the terms, I would like to reply to your question about “Ptr(y,x) != Pserv(y,X) and Covariate refer to Ptr(y|x) = Pserv(y|x) but Ptr(X) != Pserv(x) relating to covartiate shift.”
Yes, the issue you found could be related to covariate shift. If the distribution of the features in the training data is different from the distribution of the features in the test data, then the model will not be able to generalize well to the test data. This is because the model will have learned to predict the target variable based on the distribution of the features in the training data, but the test data will have a different distribution.
In your example, the change in the average number of messages sent per minute could be due to a change in the way that data is being collected. If the data is being collected from a different population, or if the data is being collected in a different way, then the distribution of the features will change. This could lead to covariate shift, which will impact the performance of the model.
If the distribution of the features and the target variable both change, then this is called dataset shift. Dataset shift is more severe than covariate shift, as it can cause the model to learn the wrong relationship between the features and the target variable.
About your second question on " is that because there is a change on P(x) already so we don’t classify it a concept change?"
Yes, you are correct. If the distribution of the features changes, then the relationship between the features and the target variable may also change. This is called concept shift.
In your example, the average number of messages sent per minute increased with the day. This could be due to a number of factors, such as people being more active on social media during certain times of the day. If the model is not updated to reflect this change, then it will not be able to predict the number of messages sent per minute accurately.
Concept shift is a more challenging problem to deal with than covariate shift. This is because the model needs to be able to learn the relationship between the features and the target variable on an ongoing basis. There are a number of techniques that can be used to deal with concept shift, such as online learning and ensemble methods.
I hope this helps to clarify the difference between concept shift and covariate shift.
Please feel free to ask a followup question if you have any.
Regards,
Can