In the third week of the second course in the MLS, in the first optional lab about model evaluation, to calculate the mean and standard deviation of the training set the function used is StandardScaler().fit_transform, whereas in the same lab to do the same operation on the validation set the function used is StandardScaler().transform. I am not able to contemplate the change in the functions, as to why different functions are used. Can someone please help me out here?
This difference in using .fit_transform
for the training set and .transform
for the validation set is a crucial aspect of data preprocessing in machine learning.
Explanation of .fit_transform
vs. .transform
-
StandardScaler().fit_transform(training_data)
:- The
.fit_transform()
function calculates the mean and standard deviation of the training data and then scales it based on these values. - This ensures that the model learns the scaling parameters only from the training set.
- The
-
StandardScaler().transform(validation_data)
:- When applying scaling to the validation data (or any new data), you should use
.transform()
only. - Using
.transform()
applies the same scaling parameters (mean and standard deviation) from the training set to the validation set, ensuring the model’s performance is evaluated on data scaled consistently with training.
- When applying scaling to the validation data (or any new data), you should use
Why This Matters
If we used .fit_transform()
on the validation set, it would calculate new scaling parameters based on the validation data, introducing data leakage. This could cause inconsistencies in model evaluation since the validation set’s mean and standard deviation would differ from the training set’s, impacting model accuracy and generalizability.
@wai_yar_aung111 thank you very much for the clarification. I completely missed this crucial step to use the fitted parameters from the training set on the validation set too.
Short answer:
We don’t train (i.e. fit) on the validation set.