You might already have this sorted out, but the question showed up in the Deep Learning digest email and is related to things I’ve been wondering about. Note that I’m just learning about this stuff, so take anything I say with a grain of salt.
My understanding is that the issue with the model not converging is primarily related to the values of the labels not being scaled, and to a lesser extend the use of the Adam optimizing algorithm.
The dataset’s labels represent Income in dollars. The labels range between 27840 and 63600. The model’s loss function is calculating mean squared error. The derivatives of cost function with respect to the weights and bias are going to be calculated with something similar to:
def propagate(w, b, X, Y):
m = X.shape[1]
Z = np.dot(w.T, X) + b
cost = (1 / (2 * m)) * np.sum((Z - Y) ** 2)
dw = (1 / m) * np.dot(X, (Z - Y).T)
db = (1 / m) * np.sum(Z - Y)
The model’s initial predictions (Z) are going to be very small, and the values of Y are very large. For the predictions to get close to the target, the model’s parameters are going to have to become very big. That can be accomplished over enough iterations, but with unscaled Y values, it will happen faster with a gradient descent optimizing algorithm than with the Adam algorithm that you are using.
For reference, when I train the model with gradient descent (implemented with NumPy), here are the parameters after 4000 iterations of training (with 3.5% error):
'w': array([[ 60.55772565],
[7650.66789304]]),
'b': 39990.89498781739}
Using the Adam optimizer in the TensorFlow model (tf.keras.optimizers.Adam
) is actually making the situation worse. The problem is related to how Adam handles large gradients. Using the derivative of loss with respect to the weights as an example:
W = W - \alpha\frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}} + \epsilon}
In the above formula:
- V_dW is the exponentially weighted moving average of the derivative of loss with respect to W
- S_dW is the exponentially weighted moving average of the square of the derivative of loss with respect to W
In linear regression, if the values of the labels are very large, the derivatives that are calculated at the start of the training process will also be very large.
With the gradient descent algorithm, large derivatives lead to large parameter updates.
With the Adam algorithm, the ratio V_{dW}/\sqrt{S_{dW}} tends to damp out large oscillations that can be triggered by large derivatives. Essentially, V_{dW}/\sqrt{S_{dW}} can end up close to plus or minus 1, so parameters get updated by a value close to plus or minus the learning rate. My understanding is that generally speeds up the learning process, but for this case, with large Y values, it seems to significantly slow down the learning process.
Here’s a comparison between the learning rates with Adam and gradient descent in the TensorFlow model. The green slope at the top is both the Adam_lr_0.01 and Adam_lr_0.001 tests. They are too similar at that scale to appear as separate lines.
Optimizer comparison code
import pandas as pd
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# Set random seeds or the results will be all over the place
np.random.seed(42)
tf.random.set_seed(42)
def create_model(input_dim):
"""
Create the linear regression model
Args:
input_dim: integer specifying the number of input features
"""
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(input_dim,)),
tf.keras.layers.Dense(1, activation="linear")
])
return model
def train_model(X, Y, optimizer, learning_rate, epochs=200):
"""
Train model with given optimizer and learning rate, return history
"""
# Normalize the input data
normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(X)
X_norm = normalizer(X)
model = create_model(X.shape[1])
if optimizer.lower() == "adam":
opt = tf.keras.optimizers.Adam(learning_rate=learning_rate)
elif optimizer.lower() == "sgd":
opt = tf.keras.optimizers.SGD(learning_rate=learning_rate)
else:
raise ValueError(f"Unsupported optimizer: {optimizer}")
model.compile(optimizer=opt, loss=tf.keras.losses.MeanSquaredError())
history = model.fit(
X_norm, Y,
epochs=epochs,
verbose=0,
callbacks=[tf.keras.callbacks.History()]
)
return model, history
def compare_optimizers(X, Y, configurations):
"""
Train models with different configurations and plot the results
"""
histories = {}
final_losses = {}
for config in configurations:
optimizer = config["optimizer"]
learning_rate = config["learning_rate"]
name = f"{optimizer}_lr_{learning_rate}"
model, history = train_model(X, Y, optimizer, learning_rate, epochs=200)
histories[name] = history.history["loss"]
final_losses[name] = history.history["loss"][-1]
# Plot training curves
plt.figure(figsize=(12, 6))
for name, loss_history in histories.items():
plt.plot(loss_history, label=f"{name} (Final Loss: {final_losses[name]:.2f})")
plt.title("Training Loss Over Time")
plt.xlabel("Epoch")
plt.ylabel("Mean Squared Error (log scale)")
plt.yscale("log")
plt.grid(True)
plt.tight_layout()
plt.legend()
plt.show()
return histories, final_losses
# Load and prepare data
# src: `https://www.kaggle.com/datasets/hussainnasirkhan/multiple-linear-regression-dataset`
data = pd.read_csv("multiple_linear_regression_dataset.csv")
X = data.drop(["income"], axis=1)
X = X.to_numpy().astype(np.float64)
Y = data["income"]
Y = Y.to_numpy().astype(np.float64)
# Configurations to test
configurations = [
{"optimizer": "adam", "learning_rate": 1e-2},
{"optimizer": "sgd", "learning_rate": 1e-2},
{"optimizer": "adam", "learning_rate": 1e-3},
{"optimizer": "sgd", "learning_rate": 1e-3}
]
histories, final_losses = compare_optimizers(X, Y, configurations)
print("\nFinal Losses:")
for name, loss in final_losses.items():
print(f"{name}: {loss:.2f}")
Without normalizing the labels, it seems that stochastic gradient descent is a better optimizing algorithm, but I think the correct solution is to normalize the labels. Here’s a comparison between Adam and gradient descent with the labels standardized using z-score normalization. With the normalized labels, all optimizers do a lot better. Given a few more iterations, the Adam_lr_0.01 and SGD_lr_0.01 optimizers produce about equal results.
standardize_targets and inverse_standardize functions
def standardize_targets(y):
"""Standardize target values using z-score normalization"""
y_mean = np.mean(y)
y_std = np.std(y)
y_scaled = (y - y_mean) / y_std
return y_scaled, y_mean, y_std
def inverse_standardize(y_scaled, y_mean, y_std):
"""Transform standardized predictions back to original scale"""
return y_scaled * y_std + y_mean