Course2: week1: Data and Concept Change in Production ML

optimizing_wieghts · May 19, 2021, 1:39pm

Time: 3.20 min
Labelling: As the tutor mentioned that, data will change as time goes and need to do proper labelling for the ground truth data with the human effort from the domain side.

I’m trying to understand it clearly with an example. Let’s say we are doing house price prediction, it has both continuous and categorical features. We have trained the model and deployed it into production. After few months, new incoming data distribution has changed. So my question is,

do we hold the pipeline once we identify there is a change in distribution?
Does this data require human labelling with the domain?
Once this data has been labelled by human, do we need to retrain the model by combining existing trained data and this newly labelled data?

Could you please give some clear examples with structured and unstructured data if possible?

luigisaetta · May 20, 2021, 6:17am

Hi,
the questions are important questions.
First of all, we must remember one of the most important concepts A. NG has explained in course 1: we need to take a “Data centered approach”. ML models learn from data and it is of the utmost importance to spend as much of the possible effort on having good quality data.
The data can change in time.
First, we need to monitor: another important concept (remember the framework presented).

Then, we need to take action.
If the data distribution has changed we need to figure out why. Is it an effect of errors (human error, a bug in the preprocessing code) or is it a change in the data itself?
For example, we do normalization on the input data. If we discover that the mean for one feature has changed, and it is not an effect of mistakes, we need to change the pipeline for the normalization part.
If the data have changed, we need to retrain and test the model, to see if we’re able to return to the original performances, in order to adapt to the change in the outside world. The example for the House prices is good here: it there has been a big change in the market and the houses’ price has lowered our current model will give worse performances. We don’t need to change the algorithm but we need to gather good quality data and retrain and retest.
Labeling is only one part, where often humans are involved. Again, as Andrew has explained, by working on the quality and consistency of labels we can improve the model performances.
To make a long story short, there is not a simple answer to your questions. But I would say you’ll find a lot of other answers during the course.

optimizing_wieghts · May 20, 2021, 11:04am

@luigisaetta thanks for your reply. As you said that, change the normalisation pipeline?

Could you please clearly tell me one thing?

Firstly, the new incoming data has to be labelled by human knowledge? am I right?
Once the new data has been labelled, do we need to combine the already trained data along with this newly labelled data together and train the model?

could you please kindly provide an answer to this?

luigisaetta · May 20, 2021, 12:20pm

Hi,
There are no easy answers to these questions. It all depends on the context.

Just one example: monitoring the performance of your model, you see that performances are becoming worse.
Two possible situations:

Data drift: a feature has now a mean that has changed. You need to process (to standardize) with the new mean. The pipeline needs this change. But after standardization, you could combine new and old (all standardized data) for retraining.
Concept drift: the X-> y mapping has changed. For example, a 3 rooms house now has a lower price. You change the way you label the data, you cannot use the old data (linked to a different X → y mapping) for training.

One final thing: every step, every change must be tested to see if it works or not. I have seen situations where what seemed a good idea didn’t work. Production data are complex, you have normally many features and you’re not sure you understand everything.

Topic		Replies	Views
Question: week 1, steps of an ML project -2.30 min Introduction to Machine Learning in Production	2	585	May 17, 2021
Data labelling (creating label) Machine Learning Data Lifecycle in Production	3	533	July 13, 2021
Process feedback advantages Machine Learning Data Lifecycle in Production	6	440	July 19, 2023
Detecting drift vs Detecting model degradation Machine Learning Data Lifecycle in Production	5	549	June 20, 2021
Course1: week2: Error analysis example Introduction to Machine Learning in Production	2	581	May 20, 2021

Course2: week1: Data and Concept Change in Production ML

Related topics