Gradient Descent vs Stochastic GD

Igodlab · October 12, 2021, 2:36am

For the programming assignment of week 2 it is many times stated that SGD computes faster than GD. I am still trying to figure out why, SGD uses one extra nested for loop for the m-training examples vs GD which uses only the loop over number of iterations and uses vectorized computation. As stated in the notebook “Optimization_methods.ipynb”:

(Batch) Gradient Descent:

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost += compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)

Stochastic Gradient Descent:

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost += compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly.

kampamocha · October 16, 2021, 12:46am

Hi @Igodlab,

I think what happens is that in GD you only update the parameters once for every epoch (one pass over all of the training set) so you take a long time before every step, while in SGD you update your parameters m times for every epoch, so you make m steps per epoch.

Now, given that every example can vary from one another, then direction is not “stable” and the steps can oscillate, instead of the smoother steps taken by GD.

Igodlab · October 21, 2021, 7:08pm

HI, exactly this is why it doesn’t make sense. In GD you update parameters once for every epoch, that is O(n_epoch). But in SGD you update them m times per epoch! so this is O(m * n_epoch) which is much m times longer

kampamocha · October 21, 2021, 7:46pm

Hi @Igodlab,

I think in both cases the complexity is O(m * n_epochs) because although you are not making as many updates in the GD case as in the SGD case, you need to iterate over all examples in each epoch.

Think about it in this way, with SGD you take a step every time you see an example and with GD you have to wait until the epoch is finished to take your first step. So, at the end of the first epoch SGD has advanced m steps and GD just one.

Sure, the direction of the lonely GD step may be better but is only one step, while in SGD you go many steps, maybe with some missdirections along the way, hence the “oscillations”, but you are going faster.

Igodlab · October 21, 2021, 9:06pm

I see, I think I am starting to understand why. SGD converges faster because it updates its step with higher frequency. Thanks for the responses @kampamocha

Topic		Replies	Views
Gradient Descent Query Advanced Learning Algorithms week-3	1	347	September 16, 2023
DLS Course 2,Week2,Programming Assignment(Exercise 3 and Exercise 5) Improving Deep Neural Networks: Hyperparameter tun	4	700	June 30, 2022
Why is Batch Gradient Descent slower than Mini-batch Gradient Descent Improving Deep Neural Networks: Hyperparameter tun	1	548	November 27, 2022
Epoch size of Stochastic Gradient Descent (SGD) Improving Deep Neural Networks: Hyperparameter tun	1	502	June 4, 2023
Stochastic Gradient Descent Definition Improving Deep Neural Networks: Hyperparameter tun	2	633	September 29, 2022

Gradient Descent vs Stochastic GD

Related topics