In the optional lab “Model Evaluation and Selection” we performed feature scaling after splitting the dataset into train, cv and test subsets. Therefore it is explained that it’s necessary to use mean and sd for train data to scale cv and test subsets. Wouldn’t it be easier to scale all data before splitting it? Or is there a reason why we perform scaling after splitting the data?
Hi @oldwardrober,
Scaling the data after splitting it into training, CV, and test ensures that the mean
and standard deviation
are derived solely from the training data, preventing data leakage. This approach simulates a real-world scenario where future data is unseen.
Hope it helps!
2 Likes