Linear regression with neural network (need help and advice)

Hello all. I have been trying to code linear regression using a neural network (NN). I defined an NN with one layer and one neuron with a linear activation function.

  • Here is the link to the data I am using:

Multiple Linear Regression Dataset | Kaggle.

  • Here is the code of my model:
#######################################################
# neural network (1 neuron) to compute linear regression 
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(X.shape[1],)),    #define imput shape
    tf.keras.layers.Dense(1, activation='linear')  #one layer with linear activation (linear regression model)
])

# Create the normalization layer
normalizer = tf.keras.layers.Normalization(axis=-1)

# Fit the normalizer to your data
normalizer.adapt(X)

# Apply normalization to your data
X_norm = normalizer(X)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),\
              loss=tf.keras.losses.MeanSquaredError())
model.fit(X_norm,Y,epochs = 200,verbose=1)
########################################################

Where Y (target) is the income. Also X.shape=(20,2). My problem is that my losses start very high (order 1e9), and the algorithm fails to converge fast (even with 200 epochs, losses are still decaying linearly with the epochs). I’ve tried bigger epochs but it still does not solve the problem. But I thought it should converge fast. Using 200 epochs for this problem sounds absurd to me. What am I doing wrong? Can someone help, pls?

I’m going to do two things here:

  1. I’ll edit your post so the code is wrapped in the “preformatted text” tag. This will cause it to be shown as program code instead of Markdown.

  2. I’m going to move it to the AI Discussions forum, instead of a course assignment forum area. That will help it attract a wider audience, and also make it clear that it isn’t code for a graded assignment (which would not be allowed by the Code of Conduct).

2 Likes

You might already have this sorted out, but the question showed up in the Deep Learning digest email and is related to things I’ve been wondering about. Note that I’m just learning about this stuff, so take anything I say with a grain of salt.

My understanding is that the issue with the model not converging is primarily related to the values of the labels not being scaled, and to a lesser extend the use of the Adam optimizing algorithm.

The dataset’s labels represent Income in dollars. The labels range between 27840 and 63600. The model’s loss function is calculating mean squared error. The derivatives of cost function with respect to the weights and bias are going to be calculated with something similar to:

def propagate(w, b, X, Y):
    m = X.shape[1]
    Z = np.dot(w.T, X) + b

    cost = (1 / (2 * m)) * np.sum((Z - Y) ** 2)

    dw = (1 / m) * np.dot(X, (Z - Y).T)
    db = (1 / m) * np.sum(Z - Y)

The model’s initial predictions (Z) are going to be very small, and the values of Y are very large. For the predictions to get close to the target, the model’s parameters are going to have to become very big. That can be accomplished over enough iterations, but with unscaled Y values, it will happen faster with a gradient descent optimizing algorithm than with the Adam algorithm that you are using.

For reference, when I train the model with gradient descent (implemented with NumPy), here are the parameters after 4000 iterations of training (with 3.5% error):

'w': array([[  60.55772565],
                 [7650.66789304]]), 
'b': 39990.89498781739}

Using the Adam optimizer in the TensorFlow model (tf.keras.optimizers.Adam) is actually making the situation worse. The problem is related to how Adam handles large gradients. Using the derivative of loss with respect to the weights as an example:

W = W - \alpha\frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}} + \epsilon}

In the above formula:

  • V_dW is the exponentially weighted moving average of the derivative of loss with respect to W
  • S_dW is the exponentially weighted moving average of the square of the derivative of loss with respect to W

In linear regression, if the values of the labels are very large, the derivatives that are calculated at the start of the training process will also be very large.

With the gradient descent algorithm, large derivatives lead to large parameter updates.

With the Adam algorithm, the ratio V_{dW}/\sqrt{S_{dW}} tends to damp out large oscillations that can be triggered by large derivatives. Essentially, V_{dW}/\sqrt{S_{dW}} can end up close to plus or minus 1, so parameters get updated by a value close to plus or minus the learning rate. My understanding is that generally speeds up the learning process, but for this case, with large Y values, it seems to significantly slow down the learning process.

Here’s a comparison between the learning rates with Adam and gradient descent in the TensorFlow model. The green slope at the top is both the Adam_lr_0.01 and Adam_lr_0.001 tests. They are too similar at that scale to appear as separate lines.

Optimizer comparison code
import pandas as pd
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt


# Set random seeds or the results will be all over the place
np.random.seed(42)
tf.random.set_seed(42)


def create_model(input_dim):
    """
    Create the linear regression model

    Args:
    input_dim: integer specifying the number of input features
    """
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(input_dim,)),
        tf.keras.layers.Dense(1, activation="linear")
    ])
    return model


def train_model(X, Y, optimizer, learning_rate, epochs=200):
    """
    Train model with given optimizer and learning rate, return history
    """
    # Normalize the input data
    normalizer = tf.keras.layers.Normalization(axis=-1)
    normalizer.adapt(X)
    X_norm = normalizer(X)

    model = create_model(X.shape[1])

    if optimizer.lower() == "adam":
        opt = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    elif optimizer.lower() == "sgd":
        opt = tf.keras.optimizers.SGD(learning_rate=learning_rate)
    else:
        raise ValueError(f"Unsupported optimizer: {optimizer}")

    model.compile(optimizer=opt, loss=tf.keras.losses.MeanSquaredError())

    history = model.fit(
        X_norm, Y,
        epochs=epochs,
        verbose=0,
        callbacks=[tf.keras.callbacks.History()]
    )

    return model, history


def compare_optimizers(X, Y, configurations):
    """
    Train models with different configurations and plot the results
    """
    histories = {}
    final_losses = {}

    for config in configurations:
        optimizer = config["optimizer"]
        learning_rate = config["learning_rate"]
        name = f"{optimizer}_lr_{learning_rate}"

        model, history = train_model(X, Y, optimizer, learning_rate, epochs=200)
        histories[name] = history.history["loss"]
        final_losses[name] = history.history["loss"][-1]

    # Plot training curves
    plt.figure(figsize=(12, 6))

    for name, loss_history in histories.items():
        plt.plot(loss_history, label=f"{name} (Final Loss: {final_losses[name]:.2f})")

    plt.title("Training Loss Over Time")
    plt.xlabel("Epoch")
    plt.ylabel("Mean Squared Error (log scale)")
    plt.yscale("log")
    plt.grid(True)
    plt.tight_layout()
    plt.legend()
    plt.show()

    return histories, final_losses


# Load and prepare data
# src: `https://www.kaggle.com/datasets/hussainnasirkhan/multiple-linear-regression-dataset`
data = pd.read_csv("multiple_linear_regression_dataset.csv")
X = data.drop(["income"], axis=1)
X = X.to_numpy().astype(np.float64)
Y = data["income"]
Y = Y.to_numpy().astype(np.float64)


# Configurations to test
configurations = [
    {"optimizer": "adam", "learning_rate": 1e-2},
    {"optimizer": "sgd", "learning_rate": 1e-2},
    {"optimizer": "adam", "learning_rate": 1e-3},
    {"optimizer": "sgd", "learning_rate": 1e-3}
]

histories, final_losses = compare_optimizers(X, Y, configurations)

print("\nFinal Losses:")
for name, loss in final_losses.items():
    print(f"{name}: {loss:.2f}")

Without normalizing the labels, it seems that stochastic gradient descent is a better optimizing algorithm, but I think the correct solution is to normalize the labels. Here’s a comparison between Adam and gradient descent with the labels standardized using z-score normalization. With the normalized labels, all optimizers do a lot better. Given a few more iterations, the Adam_lr_0.01 and SGD_lr_0.01 optimizers produce about equal results.

standardize_targets and inverse_standardize functions
def standardize_targets(y):
    """Standardize target values using z-score normalization"""
    y_mean = np.mean(y)
    y_std = np.std(y)
    y_scaled = (y - y_mean) / y_std
    return y_scaled, y_mean, y_std


def inverse_standardize(y_scaled, y_mean, y_std):
    """Transform standardized predictions back to original scale"""
    return y_scaled * y_std + y_mean

2 Likes

Hello, Scossar. No, I hadn’t figured out the solution until you explained it to me. Thank you for the detailed answer. From now on, I will also keep in mind whether I should normalize the target and use the optimizer more carefully too.

1 Like

Generally you don’t need to normalize the labels of the training examples.
Reducing the learning rate (by a lot) can mitigate this. This specific data set maybe one where normalizing the labels allows you to use a larger learning rate.

2 Likes