According to the section ‘Validating Data’ of week 1, feature skew is defined as ‘Training feature values are different than the serving feature values’.
I am confused with such description. Isn’t it normal that the feature values are different among each data instance ? For example, body weight is a feature of the dataset, and the value for this feature varies between each instance in the training set and serving set.
However, I think the difference in statistical distribution should be classified as distribution skew, rather than feature skew. Could you further clarify it ?
We want the distribution of features to be similar across both training and serving datasets. If we assume that the training data is normally distributed, we can consider a few test points that fall outside mean +/- 3 standard deviations of the training distribution for the feature as an outlier (assuming your problem allows for such a consideration).
On the other hand, if a feature distribution in training and serving datasets are unrelated, we have a problem. This is referred to as feature skew. Common causes for this scenario are:
Feature values are actually different: Consider a clothing shop in USA who use number of customers per day as a feature to build a model to predict daily profit. If the model was trained in March and deployed during thanksgiving time, we can have completely different values for training and serving datasets.
Transformation parameters should be learnt from the training data and the learnt parameters should be used to transform the serving dataset using the same methodology. If this isn’t the case, there’s likely going to be a difference in feature values across the 2 datasets.
In feature skew, we are concerned about individual features across training and serving datasets. In the case of distribution skew, we study how input and output features interact across both datasets.