Stochastic Gradient Descent convergence

Why won’t SGD ever converge?

Where does it say SGD never converges? Do you have a reference? I think it’s the same answer as with everything here: it depends. Both on your data and your selection of the relevant hyperparameters.

1 Like

Can you please give us an example?? When would Stochastic Gradient converge and when it won’t?

Here is an example.

I used “Boston Housing price regression dataset”, which is part of Keras.

# load test data
boston_housing = tf.keras.datasets.boston_housing
(train_data, train_labels), (test_data, test_labels) = boston_housing.load_data()

# Shuffle the training set
order = np.argsort(np.random.random(train_labels.shape))
train_data = train_data[order]
train_labels = train_labels[order]

# normalize data
mean = train_data.mean(axis=0)
std = train_data.std(axis=0)
train_data = (train_data - mean) / std
test_data = (test_data - mean) / std

Then, created a small model which consists of 3 Dense layers to predict a housing price, and used “SGD” for optimization. Here is the result.

I run 3 times with re-shuffling the same data. Even if I’m using the same data, one trial did not converge.

As it is “Stochastic”, it is difficult to say whether it will converge or not, “deterministically”. So, as Paul said, what we can say is… “it depends”.