Cross-validation set appears to undergo independent (from training set) scaling in optional lab

In the lectures, we are informed that the scaling of features in a cross validation data set should use the same mean and standard error from the training set, that the predictions (yhat) should be accurate.

When reading through the code in the optional lab, it appears to me that scaling of cross validation data occurs independently. Code below:

# Initialize lists to save the errors, models, and feature transforms
train_mses = []
cv_mses = []
models = []
polys = []
scalers = []

# Loop over 10 times. Each adding one more degree of polynomial higher than the last.
for degree in range(1,11):
    
    # Add polynomial features to the training set
    poly = PolynomialFeatures(degree, include_bias=False)
    X_train_mapped = poly.fit_transform(x_train)
    polys.append(poly)
    
    # Scale the training set
    scaler_poly = StandardScaler()
    X_train_mapped_scaled = scaler_poly.fit_transform(X_train_mapped)
    scalers.append(scaler_poly)
    
    # Create and train the model
    model = LinearRegression()
    model.fit(X_train_mapped_scaled, y_train )
    models.append(model)
    
    # Compute the training MSE
    yhat = model.predict(X_train_mapped_scaled)
    train_mse = mean_squared_error(y_train, yhat) / 2
    train_mses.append(train_mse)
    
    # Add polynomial features and scale the cross validation set
    X_cv_mapped = poly.transform(x_cv)
    X_cv_mapped_scaled = scaler_poly.transform(X_cv_mapped)
    
    # Compute the cross validation MSE
    yhat = model.predict(X_cv_mapped_scaled)
    cv_mse = mean_squared_error(y_cv, yhat) / 2
    cv_mses.append(cv_mse)
    
# Plot the results
degrees=range(1,11)
utils.plot_train_cv_mses(degrees, train_mses, cv_mses, title="degree of polynomial vs. train and CV MSEs")

Second from last section. I am properly misinterpreting this but cannot figure out how.

It may seem like the cross-validation data is scaled independently because the scaling step using the same poly and scaler_poly objects (fitted to the training data) is repeated for the cross-validation data. However, the key detail is that the scaler_poly object is not re-fitted to the cross-validation data; it simply applies the scaling transformation based on the statistics computed from the training set.

If you were to independently scale the cross-validation data (i.e., by fitting a new StandardScaler to X_cv_mapped), you would be using information from the cross-validation set that the model should not have during training. This could lead to overly optimistic performance metrics.

1 Like