Can Feature Scaling be applied for test set?

Hi,
I had one doubt related to feature scaling.
I have a binary classification problem, for which I implemented mean normalization on the training set. The test set is untouched, i.e., it is raw.

After training the ML model, I want to generate predictions for the test set and see how the model performs. When I feed the untouched test dataset directly to the model, I get a low accuracy score (about 0.75). On the other hand, if I first apply feature scaling using the same scaler object defined in the training step on the test dataset, I achieve a better accuracy score (about 0.88).

My question is whether feature scaling needs to be applied on the test set or not? In other words, should I judge a model’s performance without scaling the test data?

Hi @cs_Chinmay
Yes you want to do feature scaling in your data to make a good prediction and reduce different distributions

Please feel free to ask any questions,
Thanks,
Abdelrahmam

After you normalize the training set, you have to apply the same normalization to the test set.

@tazet, I’m going to delete your reply, because I think it refers to a different course (not MLS). It also referred to dropout, but the topic of this thread is feature scaling.

Thanks for the response. So, does this mean that if the model goes into production, whatever new data comes, it will get normalized by the scaler object defined during the training step, right?

Yes. Because the weight values you learned presumed that the data had a specific normalization.

You can’t use those weights to make predictions unless you apply the same normalization to the new data.

@cs_Chinmay

As all had said here. I want to add some minor things. which is you need to apply the normalization or the features scaling to the test set with the exact parameters of the train set and never invoke the test set at any calculation.

for example, if your goal is to standardize the data you have like make it have a mean of 0 and std of 1 what you will do is \displaystyle{x = \frac{x-\mu}{\sigma}}

## your train set
x_train = ...

## your test set
x_test = ...

mu = np.mean(x_train)
sigma = np.std(x_train)

## transform the x_train
x_train = (x_train - mu) / sigma

## transform the x_test
x_test = (x_test - mu) / sigma

see here we use the mu and sigma of the train set not the test set. this is because we said before that the test set we don’t know anything about and never get invoked at any tuning and improving our ML model.

Hi @cs_Chinmay .
Yes, You must normalise the test data to the same range as you do to the training set. It is necessary to do so to get correct prediction/ for better performance.
As you have mentioned, you must use the same scalar for scaling feature of test set that you used for training set. It helps you with data leakage during testing phase.