Normalization of test data

vasyl.delta · December 22, 2022, 12:31pm

Hello everyone,
Thanks for the wonderful course on Machine Learning.
I found very useful proposed technique to normalize training data in a way:
(X-mu)/sigma
where mu is a mean value of some feature and sigma is its standard deviation.

My question is: while working with test data, should they be normalized with the mu and sigma calculated on training data or mu and sigma should be recalculated again in accordance with test data? Basically, in both cases they are expected to be close, but I see some differences when applied to some classification problems.

Will be grateful for the answer.
My best regards,
Vasyl,
Kyiv, Ukraine

pastorsoto · December 22, 2022, 12:45pm

Hi @vasyl.delta It is generally a good practice to normalize the test data using the same mean and standard deviation (mu and sigma) that were used to normalize the training data. This is because the test data should be representative of the distribution of data that the model will see in production, and the model was trained on data that was normalized using those same values of mu and sigma.

If you recalculate the mean and standard deviation for the test data, you risk introducing a discrepancy between the distribution of the test data and the distribution of the data that the model was trained on. This can lead to a difference in the model’s performance on the test data compared to its performance on production data.

AbdElRhaman_Fakhry · December 22, 2022, 12:54pm

Hi @vasyl.delta

It is a good question !

choosing to compute mu and sigma for test set .I think mostly It is depending on

First:

domain of Training set and Domain of test if they are equally different and the test set close to real world data(Practical data) in this case you should compute the mu and sigma for training data and mu and also compute sigma for test set as the are from different domains

Second :

size of data train & test if the training set size is big and covers most aspects of the data in terms of the presence of realistic data as well as the test set size if small or close to train set in this case you didn’t want to compute mu and sigma for test set …it is enough to compute only mu and sigma for train set and apply it for all cases

Thanks!
Abdelrahman

vasyl.delta · December 23, 2022, 12:16pm

Thank you for such a comprehensive answer!

vasyl.delta · December 23, 2022, 12:17pm

Thank you very much!

Topic		Replies	Views
C1_W2_Lab03_Feature_Scaling_and_Learning_Rate_Soln - normalizing the testing data Supervised ML: Regression and Classification week-2	6	518	July 14, 2022
Normalization before a new prediction? Supervised ML: Regression and Classification week-2	3	529	August 27, 2022
Course 2 Week 1 Video "Normalizing Inputs" doubt regarding "m" Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	497	February 28, 2022
How to implement the feature scaling in prediction? Supervised ML: Regression and Classification week-2	1	524	June 23, 2022
Questions on normalizing really huge data Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	555	December 27, 2022

Normalization of test data

First:

Second :

Related topics