Effect of feature scaling on a model's parameters

NaughtyConstrictor · December 2, 2024, 2:00pm

I tried to do my own little experiment to see the effect of feature scaling on the convergence of the gradient descent algorithm like what was shown in the lectore (and I didn’t expect to see effects on the learned parameters):

import numpy as np

rng = np.random.default_rng(42)

x1 = np.linspace(0, 5, 1000)
x2 = np.linspace(300, 2000, 1000)
X1, X2 = np.meshgrid(x1, x2)

w0, w1, w2 = rng.random(3)
y = w0 + w1 * X1 + w2 * x2 + rng.normal(size=X1.shape)

num_samples = 1000
indices = rng.choice(X1.size, num_samples, replace=False)
X_samples = np.c_[X1.ravel()[indices], X2.ravel()[indices]]
y_sample = y.ravel()[indices]


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_samples, y_sample, test_size=0.2, random_state=42)

The real parameters w0, w1, w2 are the following:
[0.773956048, 0.438878439, 0.858597919911]

I first tried training my own regression model with gradient descent:

def cost(X, y, w0, w1, w2):
    y_hat = w0 + w1 * X[:, 0] + w2 * X[:, 1]
    return np.mean((y_hat - y)**2) / 2

def gradient(X, y, w0, w1, w2):
    y_hat = w0 + w1 * X[:, 0] + w2 * X[:, 1]
    diff = y_hat - y
    dw0 = diff
    dw1 = diff * X[:, 0]
    dw2 = diff * X[:, 1]
    return np.mean([dw0, dw1, dw2], axis=1)

def train(X, y, w0, w1, w2, learning_rate=0.01, iter=1000, mod=10):
    J = []
    for i in range(1, iter + 1):
        J.append(cost(X, y, w0, w1, w2))
        dw0, dw1, dw2 = gradient(X, y, w0, w1, w2)
        w0 = w0 - learning_rate * dw0
        w1 = w1 - learning_rate * dw1
        w2 = w2 - learning_rate * dw2
        if i == 1 or i % mod == 0:
            print(f"epoch-{i}, J={J[-1]}, params={[w0, w1, w2]}")
            
    return (J, [w0, w1, w2])

But i had to set the learning rate alpha to be very small (1e-6) otherwise the cost would go to infinity early on, and the number of iteration needed for it to converge was very very big (somewhere between a million epochs, where the cost was still high; around 2700-3000, and 50 million epochs). Fair enough, this goes well with the lecture.

J, (w00, w11, w22) = train(X_train, y_train, 0, 0, 0, learning_rate=1.0e-6, iter=50000000, mod=1000000)

But the parameters I ended up with were very different from the ones I used to generate the data:

w00, w11, w22 = np.float64(258.4471608900622), np.float64(292.34820153895186), np.float64(-6.901173958145274e-05)

Is this normal or how is this possible? (the cost was low btw, around 0.5065489)

And then I tried to use scikit-learn’s regression model (which uses the least squares method from scipy to solve the equation I believe instead of gradient descent) to see if I get similar results:

On raw data

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)
reg.intercept_, reg.coef_
(np.float64(258.4471684403575), array([ 2.92348201e+02, -6.90154603e-05]))

Standardized data

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

reg2 = LinearRegression()
reg2.fit(StandardScaler().fit_transform(X_train), y_train)
reg2.intercept_, reg2.coef_
(np.float64(977.8397583139017), array([ 4.27239465e+02, -3.47854037e-02]))

Min-Xax scaled data

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler

reg3 = LinearRegression()
reg3.fit(MinMaxScaler().fit_transform(X_train), y_train)
reg3.intercept_, reg3.coef_
(np.float64(258.4262289148239), array([ 1.46174100e+03, -1.17091395e-01]))

For all of the above models, I get the same mean squared error: np.float64(1.0130978057750846)
Which is pretty good given the distribution of my target variable which ranges from around 250 tp 1700.

And I also tried the SGDRegressor from scikit-learn on the raw data, with a very big number of iteration (almost the maximum possible) and
re-fittingg in a loop (although I Don’t know if it continues from the current learned parameters or if it restarts the Learning) and it did not converge at all, but It did after scaling the data (which was suggested in their documentation)

SGDRegressor

from sklearn.linear_model import SGDRegressor

reg = SGDRegressor(loss="squared_error", penalty=None, max_iter=4_000_000_000)
for _ in range(1, 10_000 + 1):
    reg.fit(X_train, y_train)

print(f"{reg.coef_}, {reg.intercept_}")
[ 2.13163675e+11 -1.41046022e+11], [-1.66476243e+10]

from sklearn.metrics import mean_squared_error

mean_squared_error(y_train, reg.predict(X_train))
np.float64(3.0121150102982553e+28)

Do you have any comments or explanations please? Like how come the coefficients I obtained are so far off from the original ones? I know that in machine Learning we are more interested in getting a similar mapping between the inputs and the outputs rather than approximating the real parameters used to generate the data, and I guess it makes sense for the parameters to be very different for the version where I scaled the data, but still, even the parameters obtained from fitting the model to the raw data are very far off, and what’s more is that the error is very small, and when I plot the hyperplanes generated by the models I find that they look almost identical, it looks like they are all on top of each other.

import matplotlib.pyplot as plt

%matplotlib ipympl


scaler = StandardScaler().fit(X_train)
minmaxer = MinMaxScaler().fit(X_train)

new_X = np.hstack((X1.ravel().reshape(-1, 1), X2.ravel().reshape(-1, 1)))
X_scaled = scaler.transform(new_X)
X_minmax = minmaxer.transform(new_X)


fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(111, projection="3d")
ax.plot_surface(X1, X2, y, label="True")
ax.plot_surface(X1, X2, reg.intercept_ + reg.coef_[0] * X1 + reg.coef_[1] * X2, label="Lin Reg")
ax.plot_surface(
    X1, X2, 
    reg2.intercept_ + reg2.coef_[0] * X_scaled[:, 0].reshape(X1.shape) + reg2.coef_[1] * X_scaled[:, 1].reshape(X2.shape), 
    label="Lin Reg std"
)
ax.plot_surface(
    X1, X2, 
    reg3.intercept_ + reg3.coef_[0] * X_minmax[:, 0].reshape(X1.shape) + reg3.coef_[1] * X_minmax[:, 1].reshape(X1.shape), 
    label="Lin Reg min-max"
)
ax.set_xlabel("X1")
ax.set_ylabel("X2")
ax.set_zlabel("Z")
plt.legend()
plt.show()

I really feel like saying something stupid, like a hyperplane can be defined by more than one equation . Because there’s an infinite number of normal vectors to a hyperplane since they can all be scaled versions to the normalized normal vector right? But still, I guess that would be the case when the equation of the hyperplane is of the form: w.T * X = 0 which is not the case here?

Also, the effect of feature scaling would still depend on the cost function we are trying to optimize right? In this case it does matter because the expression of the gradient has a multiplication with the feature corresponding to the parameter being updated. But maybe for a different cost function that would not be the case and so the scale of the feature would not matter? Although I presume that is very rare or maybe impossible?

TMosh · December 2, 2024, 4:35pm

Yes, this is expected.

When you normalize the features, you’re doing a linear transform of the data set. So now you have different features, so you’ll get different weights.

SNaveenMathew · December 2, 2024, 5:11pm

Here are the weights after the first few iterations:

Although this is not a solution, I observed two issues:

From the documentation of SGDRegressor, warm_start needs to be set to True to learn (update weights) incrementally/sequentially
Using learning_rate=‘optimal’, eta0 (initial learning rate) doesn’t really affect the weights

Screenshot_20241202_120756_Chrome1593×1264 200 KB

I’ll update once I get SGDRegressor to work as expected.

NaughtyConstrictor · December 3, 2024, 10:51am

What about the weights obtained from the raw data using the analytical solution to least squares?
And why do all the obtained hyperplanes look identical despite the weights being different?

NaughtyConstrictor · December 3, 2024, 11:11am

If you check the n_iter_ attribute:

n_iter_ : int
    The actual number of iterations before reaching the stopping criterion.

You’ll find that it’s always very small (between 10 and 20 in general) and that’s because of the stopping criterion:

tol: float or None, default=1e-3
    The stopping criterion. If it is not None, training will stop when (loss > best_loss - tol) for n_iter_no_change consecutive epochs. Convergence is checked against the training loss or the validation loss depending on the early_stopping parameter. Values must be in the range [0.0, inf).

I managed to get a close result by disabling it:

from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error

reg = SGDRegressor(
    loss="squared_error", 
    penalty=None, 
    max_iter=1_000_000, 
    learning_rate="constant",
    eta0=1e-6,
    tol=None
)

reg.fit(X_train, y_train)
print(reg.coef_, reg.intercept_)
print(mean_squared_error(y_train, reg.predict(X_train)))

[2.92205932e+02 2.65538800e-03] [258.66288107]
11.52960070406345

Could be impoved by simply disabling early stopping maybe and lowering the initial value of the learning rate without making it constant.

Topic		Replies	Views
The Effect of Feature Rescaling on Convergence Supervised ML: Regression and Classification week-module-2	1	488	July 27, 2022
Optional Lab: Feature Engineering and Polynomial Regression (Feature scaling impact on Convergence) Supervised ML: Regression and Classification week-module-2	2	532	September 2, 2022
Is my understanding of Feature Scaling correct? Supervised ML: Regression and Classification week-module-2	3	530	August 12, 2022
C1_W2_Lab04_FeatEng_PolyReg_Soln Understanding Model Parameters Supervised ML: Regression and Classification week-module-2	1	18	January 30, 2025
Interpreting the benefits of feature scaling Supervised ML: Regression and Classification week-module-1	18	619	February 9, 2023

Effect of feature scaling on a model's parameters

On raw data

Standardized data

Min-Xax scaled data

SGDRegressor

Related topics