What are the causes of feature skew?

The instructor said that feature skew “could happen as the system uses different data sources
during training and serving” and he also said “things like seasonality and trend as well.”.
when you go to the practice quiz, there is question 3 where it asks about the Distribution skew occurrences.
Can anyone please check that question and the video to understand my point? Is that question wrong? there is contradiction between what the instructor said in the video and question 3 in the quiz.

There are 3 types of skews:

  1. Dataset shift
  2. Covariate shift
  3. Concept shift

There is still one more right answer in the quiz based on dataset shift. Imagine a naive bayes classifier for identifying the probability of spam given the message sender. Say, you trained your model 2 days back and a person has become a spammer since yesterday.

In serving environment, the distribution of number of mails sent by this person is way higher when compared to the training data. To put it another way, you have more samples in the serving environment than the training environment wrt this new spammer. Please use this hint to find the other answer.

Do remove quiz answers from your post.

I’m saying that there is a contradiction in the quiz and the video.
In the video, the instructor said that feature skew “could happen as the system uses different data sources during training and serving” and he also said “things like seasonality and trend as well.”. while in the quiz in question 3, “Different data sources for training and serving data” and “Trend, seasonality, changes in data over time.” are right answers for the question.

Hi Bassel,

I am trying to understand which is the part that you find confusing… could it be that according to your quote from the instructor (video) explanations for skewness are:

  1. “could happen as the system uses different data sources during training and serving”
  2. “things like seasonality and trend as well.”

So at the video are mentioned 3 factors : different data, data seasonality and data trends. Do you agree on this?

At the quiz question 3 you quoted these answers for the same question (explanations for skewness).

  1. “Different data sources for training and serving data”
  2. “Trend, seasonality, changes in data over time.”

Am I missing something here? I do not see the contradiction, they are the same factors.

Happy learning

Rosa

1 Like

I agree with you on the video part. but for the question 3 in the quiz it asks me about “Distribution skew” not “feature skew”. the correct answers are things that the instructor talked about in the feature skew part.
1- Different data sources for training and serving data.
2- Trend, seasonality, changes in data over time.
both are correct answers for question 3 that asks me about distribution skew and not feature skew.
there are 2 different concepts, feature skew and distribution skew.

Hi Bassel,

sorry for the late reply but I have been quite busy. Ok I see your point… The thing is that both concepts are related, in fact feature skew can be one of the causes for distribution skew as it is explained here:

Distribution Skews
The training and serving data sources have different distributions, even though they should be the concept the same data set. It can caused by:

The difference between the training data pipeline vs the serving application code (feature skew)
Faulty sampling in training that is not representative in serving

Data changes (distribution skew) between the time of training and time of serving because no matter how tightly you run the training, some time has passed when the system gets to serving.

Trend, seasonality
A feedback loop between the model and the algorithm. Your model prediction has an influence in the real world that affects the future data distribution in serving (ex: stock price prediction and portfolio management)

That is why what creates feature skew will end up creating you a distribution skew. Thus the answers of the quiz are correct, but I agree with you that maybe in the lecture it should be mentioned this connection between the two concepts.

Thanks for the feedback.

Best,

Rosa