Time: 3.20 min
Labelling: As the tutor mentioned that, data will change as time goes and need to do proper labelling for the ground truth data with the human effort from the domain side.
I’m trying to understand it clearly with an example. Let’s say we are doing house price prediction, it has both continuous and categorical features. We have trained the model and deployed it into production. After few months, new incoming data distribution has changed. So my question is,
- do we hold the pipeline once we identify there is a change in distribution?
- Does this data require human labelling with the domain?
- Once this data has been labelled by human, do we need to retrain the model by combining existing trained data and this newly labelled data?
Could you please give some clear examples with structured and unstructured data if possible?
Hi,
the questions are important questions.
First of all, we must remember one of the most important concepts A. NG has explained in course 1: we need to take a “Data centered approach”. ML models learn from data and it is of the utmost importance to spend as much of the possible effort on having good quality data.
The data can change in time.
First, we need to monitor: another important concept (remember the framework presented).
Then, we need to take action.
If the data distribution has changed we need to figure out why. Is it an effect of errors (human error, a bug in the preprocessing code) or is it a change in the data itself?
For example, we do normalization on the input data. If we discover that the mean for one feature has changed, and it is not an effect of mistakes, we need to change the pipeline for the normalization part.
If the data have changed, we need to retrain and test the model, to see if we’re able to return to the original performances, in order to adapt to the change in the outside world. The example for the House prices is good here: it there has been a big change in the market and the houses’ price has lowered our current model will give worse performances. We don’t need to change the algorithm but we need to gather good quality data and retrain and retest.
Labeling is only one part, where often humans are involved. Again, as Andrew has explained, by working on the quality and consistency of labels we can improve the model performances.
To make a long story short, there is not a simple answer to your questions. But I would say you’ll find a lot of other answers during the course.
@luigisaetta thanks for your reply. As you said that, change the normalisation pipeline?
Could you please clearly tell me one thing?
- Firstly, the new incoming data has to be labelled by human knowledge? am I right?
- Once the new data has been labelled, do we need to combine the already trained data along with this newly labelled data together and train the model?
could you please kindly provide an answer to this?
Hi,
There are no easy answers to these questions. It all depends on the context.
Just one example: monitoring the performance of your model, you see that performances are becoming worse.
Two possible situations:
- Data drift: a feature has now a mean that has changed. You need to process (to standardize) with the new mean. The pipeline needs this change. But after standardization, you could combine new and old (all standardized data) for retraining.
- Concept drift: the X-> y mapping has changed. For example, a 3 rooms house now has a lower price. You change the way you label the data, you cannot use the old data (linked to a different X → y mapping) for training.
One final thing: every step, every change must be tested to see if it works or not. I have seen situations where what seemed a good idea didn’t work. Production data are complex, you have normally many features and you’re not sure you understand everything.