Thanks for the wonderful course on Machine Learning.
I found very useful proposed technique to normalize training data in a way:
where mu is a mean value of some feature and sigma is its standard deviation.
My question is: while working with test data, should they be normalized with the mu and sigma calculated on training data or mu and sigma should be recalculated again in accordance with test data? Basically, in both cases they are expected to be close, but I see some differences when applied to some classification problems.
Will be grateful for the answer.
My best regards,
Hi @vasyl.delta It is generally a good practice to normalize the test data using the same mean and standard deviation (mu and sigma) that were used to normalize the training data. This is because the test data should be representative of the distribution of data that the model will see in production, and the model was trained on data that was normalized using those same values of mu and sigma.
If you recalculate the mean and standard deviation for the test data, you risk introducing a discrepancy between the distribution of the test data and the distribution of the data that the model was trained on. This can lead to a difference in the model’s performance on the test data compared to its performance on production data.
choosing to compute mu and sigma for test set .I think mostly It is depending on
domain of Training set and Domain of test if they are equally different and the test set close to real world data(Practical data) in this case you should compute the mu and sigma for training data and mu and also compute sigma for test set as the are from different domains
size of data train & test if the training set size is big and covers most aspects of the data in terms of the presence of realistic data as well as the test set size if small or close to train set in this case you didn’t want to compute mu and sigma for test set …it is enough to compute only mu and sigma for train set and apply it for all cases