What is the correct name for data and concept drift?

Should we call it data/concept skew or data/concept drift? because in the slides it says that “skew” is the difference between training and serving data while drift is the change in time in the data.
also, in the same slides he mentioned concept/data “shift”. is “shift” a different thing?

Hi, @Bassel !

Data shift usually refers to the changes that happens to a certain realily you are trying to model, and it reflects on its data. That is a crucial thing in, say, solar energy production forecast, where it clearly related to time-dependent phenomena.
Skew, in the other hand, is used to describe the data distribution differences that appear when you split your data, like training and test set. Unless, taking extreme care when splitting, it is normally inherent to having separate data sets.

What about “drift”? is data/concept drift the same as data/concept shift? I understand what is written in the image I posted but I am just confused about some concepts. Should we call it data/concept skew or data/concept drift? I don’t see “data/concept” skew often. I only see data/concept drift.

Hi @Bassel ,

In Course 1, Dr Ng defines these concepts:

  • Concept/Data Drift
  • Skewed Data

Concept/Data Drift: When the environment or data changes after the system has been trained and now it will be tested OR after the system has been deployed. This drifting can happens upon deployment, or slowly over time. Examples of changes in data:

  1. You train a system for defect detection with pictures that have certain brightness; when you deploy the system, the picture taken are darker because the lighting in the location has changed.
  2. You train a system for speech recognition with a certain dataset comprised mainly of adults. Once the system is deployed, the speech recognition starts to be used by younger people and since the voice is different, the system may start failing on predictions.

Skewed data: This happens when the ratio between positive and negative samples is very far from 50/50. This happens with the datasets when you are going to train your model. Examples:

  1. You will train a model to detect defects in a production line. Your training dataset contains 99.7% of photos with good products, and just 0.3% of photos with defective products.
  2. You will train a model to detect illness in X-rays. Your training dataset consists of 99.5% of healthy X-rays, and just 0.5% of X-rays with some illness.

Hope this sheds some more light on this question.

Thanks,

Juan

I think that data/concept skew is different than skewed data. it’s the same word “skew” but it has different meaning in both cases. I agree with you that skewed data happens when the ratio between positive and negative samples is very far from 50/50. but data/concept skew is when there is difference between training data and serving data. you can see it in the slide that i posted. I hope you tell me if I’m wrong.

You are right @Bassel - the skew I am referring to is located in the training phase of the model. The skew you a referring to, more exactly the training-serving skew, happens on production. I have found a post that discusses this specific question of yours, difference between training-serving skew and drift. Check it out: What is training-serving skew in machine learning? | Qwak's Blog

Hopefully this brings light to your question. It did shed more light on my own understanding of the matter.

Happy to share more on this.

Thanks,

Juan

1 Like

Course 2 uses the term skew, shift and drift intercangeably which is quite misleading and hard to follow. Could someone summarise all the data issus mentioned in Course 2 in clear manner, thank you.