Gradient Descent vs Stochastic GD

For the programming assignment of week 2 it is many times stated that SGD computes faster than GD. I am still trying to figure out why, SGD uses one extra nested for loop for the m-training examples vs GD which uses only the loop over number of iterations and uses vectorized computation. As stated in the notebook “Optimization_methods.ipynb”:

  • (Batch) Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost += compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)
        
  • Stochastic Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost += compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly.

1 Like

Hi @Igodlab,

I think what happens is that in GD you only update the parameters once for every epoch (one pass over all of the training set) so you take a long time before every step, while in SGD you update your parameters m times for every epoch, so you make m steps per epoch.

Now, given that every example can vary from one another, then direction is not “stable” and the steps can oscillate, instead of the smoother steps taken by GD.

HI, exactly this is why it doesn’t make sense. In GD you update parameters once for every epoch, that is O(n_epoch). But in SGD you update them m times per epoch! so this is O(m * n_epoch) which is much m times longer

Hi @Igodlab,

I think in both cases the complexity is O(m * n_epochs) because although you are not making as many updates in the GD case as in the SGD case, you need to iterate over all examples in each epoch.

Think about it in this way, with SGD you take a step every time you see an example and with GD you have to wait until the epoch is finished to take your first step. So, at the end of the first epoch SGD has advanced m steps and GD just one.

Sure, the direction of the lonely GD step may be better but is only one step, while in SGD you go many steps, maybe with some missdirections along the way, hence the “oscillations”, but you are going faster.

2 Likes

I see, I think I am starting to understand why. SGD converges faster because it updates its step with higher frequency. Thanks for the responses @kampamocha

1 Like