Cross-validation set appears to undergo independent (from training set) scaling in optional lab

rollther · August 28, 2024, 3:35pm

In the lectures, we are informed that the scaling of features in a cross validation data set should use the same mean and standard error from the training set, that the predictions (yhat) should be accurate.

When reading through the code in the optional lab, it appears to me that scaling of cross validation data occurs independently. Code below:

# Initialize lists to save the errors, models, and feature transforms
train_mses = []
cv_mses = []
models = []
polys = []
scalers = []

# Loop over 10 times. Each adding one more degree of polynomial higher than the last.
for degree in range(1,11):
    
    # Add polynomial features to the training set
    poly = PolynomialFeatures(degree, include_bias=False)
    X_train_mapped = poly.fit_transform(x_train)
    polys.append(poly)
    
    # Scale the training set
    scaler_poly = StandardScaler()
    X_train_mapped_scaled = scaler_poly.fit_transform(X_train_mapped)
    scalers.append(scaler_poly)
    
    # Create and train the model
    model = LinearRegression()
    model.fit(X_train_mapped_scaled, y_train )
    models.append(model)
    
    # Compute the training MSE
    yhat = model.predict(X_train_mapped_scaled)
    train_mse = mean_squared_error(y_train, yhat) / 2
    train_mses.append(train_mse)
    
    # Add polynomial features and scale the cross validation set
    X_cv_mapped = poly.transform(x_cv)
    X_cv_mapped_scaled = scaler_poly.transform(X_cv_mapped)
    
    # Compute the cross validation MSE
    yhat = model.predict(X_cv_mapped_scaled)
    cv_mse = mean_squared_error(y_cv, yhat) / 2
    cv_mses.append(cv_mse)
    
# Plot the results
degrees=range(1,11)
utils.plot_train_cv_mses(degrees, train_mses, cv_mses, title="degree of polynomial vs. train and CV MSEs")

Second from last section. I am properly misinterpreting this but cannot figure out how.

nadtriana · August 28, 2024, 5:21pm

It may seem like the cross-validation data is scaled independently because the scaling step using the same poly and scaler_poly objects (fitted to the training data) is repeated for the cross-validation data. However, the key detail is that the scaler_poly object is not re-fitted to the cross-validation data; it simply applies the scaling transformation based on the statistics computed from the training set.

If you were to independently scale the cross-validation data (i.e., by fitting a new StandardScaler to X_cv_mapped), you would be using information from the cross-validation set that the model should not have during training. This could lead to overly optimistic performance metrics.

Topic		Replies	Views
Scaling training and cross validation set Advanced Learning Algorithms week-3	1	325	September 24, 2023
Feature Scaling: Why don't we feature scale the training, cross validation and test data seperately? Advanced Learning Algorithms week-3	4	24	September 17, 2024
C2W3 Lab01 poly fit_transform CV no then yes Advanced Learning Algorithms week-3	8	611	April 15, 2023
Cross validation set questions Advanced Learning Algorithms week-1	5	538	July 15, 2022
C2W3 Lab Qn - Model Evaluation and Selection Advanced Learning Algorithms week-3	12	206	May 10, 2024

Cross-validation set appears to undergo independent (from training set) scaling in optional lab

Related topics