Hi,
In this lab you say that
When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations. After learning the parameters from the model, we often want to predict the prices of houses we have not seen before. Given a new x value (living room area and number of bed- rooms), we must first normalize x using the mean and standard deviation that we had previously computed from the training set.
and further down in the notebook the following operation is performed:
I understand why the new data must be normalized, what’s causing me some doubts is the way in which it was normalized. Each observation in the training data was normalized using the means and standard deviations calculated using the training data, but the new observation (i.e., the testing data) was not included in the calculation of the means and standard deviations. What I mean to say is that the way the test data was normalized is not exactly the same in which the training data was normalized. This becomes even more apparent if the testing data includes several new observations. Is there a reason why this was done in this manner? Shouldn’t the means and standard deviations be recomputed to include the new data?
Hello @Andromeda18,
In short, no, they shouldn’t be recomputed. Because your model is trained according to the way you normalize your data, and if you apply a different normalization, your model won’t work anymore.
@Andromeda18, we may also consider the following scenario:
-
We can’t include test data when calculating normalization constants for training set, because test data is supposed to be unseen. We pretend we don’t have any test data when we are training.
-
For the sake of simplifying the discussion, let’s say we have only one feature, the mean is 0, and the standard deviation is 3. If my sample feature value is 6, then after normalization, it becomes \frac{6-0}{3} = 2, and let’s assume the prediction for this sample is 2.5.
-
Now, after training the model, we receive our test set, and again, the test set has a sample with feature value 6. Now, what would you expect? If you recalculate the normalization constant, the mean is going to be larger than 0, and the standard deviation is going to be different as well, in that case, the normalized value will become something other than 2, and putting that normalized feature value into the model, the prediction will no longer be 2.5.
-
Here we are in a very embrassing situation, because given the same original feature value of 6, just because of the difference in the original and the recalculated normalization constants, we end up with different predictions! We can’t let that happen, and so we shouldn’t recalculate the normalization constants.
Cheers,
Raymond
Hi @rmwkwok,
Thank you for your reply. I wasn’t talking about including test data when calculating normalization constants for the training set, because I know nothing related to the test data should be used to train the model. I was talking about calculating the normalization constants for the whole data (train + test) after the model has been trained, prior to making a prediction. But I do understand your explanation of why it shouldn’t be done.
However, imagine there’s a considerable difference between the distributions of the training set and the test set. Better yet, let’s consider the model has been put into production (after being trained and tested) and the distribution of the new, unseen data that’s being used to make predictions is different from the distribution of the data used to train the model (data drift). If we normalize this new data with the constants calculated for the training set, the data might not be on the same scale as the training set. How should this be addressed?
This is right! In this case, your production system should send a warning email to you about this, and you will need to check if there is anything wrong about your data source, and if nothing is wrong then the systematic shift of that feature is real. If it is also true that your model performance deteriorated very badly, then you would need to consider collecting a new set of training data and retrain your model.
Raymond
I see. Thank you very much for answering my questions.
You are welcome @Andromeda18