I tried to do my own little experiment to see the effect of feature scaling on the convergence of the gradient descent algorithm like what was shown in the lectore (and I didn’t expect to see effects on the learned parameters):
import numpy as np
rng = np.random.default_rng(42)
x1 = np.linspace(0, 5, 1000)
x2 = np.linspace(300, 2000, 1000)
X1, X2 = np.meshgrid(x1, x2)
w0, w1, w2 = rng.random(3)
y = w0 + w1 * X1 + w2 * x2 + rng.normal(size=X1.shape)
num_samples = 1000
indices = rng.choice(X1.size, num_samples, replace=False)
X_samples = np.c_[X1.ravel()[indices], X2.ravel()[indices]]
y_sample = y.ravel()[indices]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_samples, y_sample, test_size=0.2, random_state=42)
The real parameters w0, w1, w2 are the following:
[0.773956048, 0.438878439, 0.858597919911]
I first tried training my own regression model with gradient descent:
def cost(X, y, w0, w1, w2):
y_hat = w0 + w1 * X[:, 0] + w2 * X[:, 1]
return np.mean((y_hat - y)**2) / 2
def gradient(X, y, w0, w1, w2):
y_hat = w0 + w1 * X[:, 0] + w2 * X[:, 1]
diff = y_hat - y
dw0 = diff
dw1 = diff * X[:, 0]
dw2 = diff * X[:, 1]
return np.mean([dw0, dw1, dw2], axis=1)
def train(X, y, w0, w1, w2, learning_rate=0.01, iter=1000, mod=10):
J = []
for i in range(1, iter + 1):
J.append(cost(X, y, w0, w1, w2))
dw0, dw1, dw2 = gradient(X, y, w0, w1, w2)
w0 = w0 - learning_rate * dw0
w1 = w1 - learning_rate * dw1
w2 = w2 - learning_rate * dw2
if i == 1 or i % mod == 0:
print(f"epoch-{i}, J={J[-1]}, params={[w0, w1, w2]}")
return (J, [w0, w1, w2])
But i had to set the learning rate alpha to be very small (1e-6) otherwise the cost would go to infinity early on, and the number of iteration needed for it to converge was very very big (somewhere between a million epochs, where the cost was still high; around 2700-3000, and 50 million epochs). Fair enough, this goes well with the lecture.
J, (w00, w11, w22) = train(X_train, y_train, 0, 0, 0, learning_rate=1.0e-6, iter=50000000, mod=1000000)
But the parameters I ended up with were very different from the ones I used to generate the data:
w00, w11, w22 = np.float64(258.4471608900622), np.float64(292.34820153895186), np.float64(-6.901173958145274e-05)
Is this normal or how is this possible? (the cost was low btw, around 0.5065489)
And then I tried to use scikit-learn’s regression model (which uses the least squares method from scipy to solve the equation I believe instead of gradient descent) to see if I get similar results:
On raw data
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
reg.intercept_, reg.coef_
(np.float64(258.4471684403575), array([ 2.92348201e+02, -6.90154603e-05]))
Standardized data
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
reg2 = LinearRegression()
reg2.fit(StandardScaler().fit_transform(X_train), y_train)
reg2.intercept_, reg2.coef_
(np.float64(977.8397583139017), array([ 4.27239465e+02, -3.47854037e-02]))
Min-Xax scaled data
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
reg3 = LinearRegression()
reg3.fit(MinMaxScaler().fit_transform(X_train), y_train)
reg3.intercept_, reg3.coef_
(np.float64(258.4262289148239), array([ 1.46174100e+03, -1.17091395e-01]))
For all of the above models, I get the same mean squared error: np.float64(1.0130978057750846)
Which is pretty good given the distribution of my target variable which ranges from around 250 tp 1700.
And I also tried the SGDRegressor from scikit-learn
on the raw data, with a very big number of iteration (almost the maximum possible) and
re-fittingg in a loop (although I Don’t know if it continues from the current learned parameters or if it restarts the Learning) and it did not converge at all, but It did after scaling the data (which was suggested in their documentation)
SGDRegressor
from sklearn.linear_model import SGDRegressor
reg = SGDRegressor(loss="squared_error", penalty=None, max_iter=4_000_000_000)
for _ in range(1, 10_000 + 1):
reg.fit(X_train, y_train)
print(f"{reg.coef_}, {reg.intercept_}")
[ 2.13163675e+11 -1.41046022e+11], [-1.66476243e+10]
from sklearn.metrics import mean_squared_error
mean_squared_error(y_train, reg.predict(X_train))
np.float64(3.0121150102982553e+28)
Do you have any comments or explanations please? Like how come the coefficients I obtained are so far off from the original ones? I know that in machine Learning we are more interested in getting a similar mapping between the inputs and the outputs rather than approximating the real parameters used to generate the data, and I guess it makes sense for the parameters to be very different for the version where I scaled the data, but still, even the parameters obtained from fitting the model to the raw data are very far off, and what’s more is that the error is very small, and when I plot the hyperplanes generated by the models I find that they look almost identical, it looks like they are all on top of each other.
import matplotlib.pyplot as plt
%matplotlib ipympl
scaler = StandardScaler().fit(X_train)
minmaxer = MinMaxScaler().fit(X_train)
new_X = np.hstack((X1.ravel().reshape(-1, 1), X2.ravel().reshape(-1, 1)))
X_scaled = scaler.transform(new_X)
X_minmax = minmaxer.transform(new_X)
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(111, projection="3d")
ax.plot_surface(X1, X2, y, label="True")
ax.plot_surface(X1, X2, reg.intercept_ + reg.coef_[0] * X1 + reg.coef_[1] * X2, label="Lin Reg")
ax.plot_surface(
X1, X2,
reg2.intercept_ + reg2.coef_[0] * X_scaled[:, 0].reshape(X1.shape) + reg2.coef_[1] * X_scaled[:, 1].reshape(X2.shape),
label="Lin Reg std"
)
ax.plot_surface(
X1, X2,
reg3.intercept_ + reg3.coef_[0] * X_minmax[:, 0].reshape(X1.shape) + reg3.coef_[1] * X_minmax[:, 1].reshape(X1.shape),
label="Lin Reg min-max"
)
ax.set_xlabel("X1")
ax.set_ylabel("X2")
ax.set_zlabel("Z")
plt.legend()
plt.show()
I really feel like saying something stupid, like a hyperplane can be defined by more than one equation . Because there’s an infinite number of normal vectors to a hyperplane since they can all be scaled versions to the normalized normal vector right? But still, I guess that would be the case when the equation of the hyperplane is of the form: w.T * X = 0
which is not the case here?
Also, the effect of feature scaling would still depend on the cost function we are trying to optimize right? In this case it does matter because the expression of the gradient has a multiplication with the feature corresponding to the parameter being updated. But maybe for a different cost function that would not be the case and so the scale of the feature would not matter? Although I presume that is very rare or maybe impossible?