Can someone tell me why we need two for-loops in SGD?
I thought number of iterations is the same as the number of training examples since only a single example is processed at a time.
Also, why do you need a for loop in batch gradient descent, when we process all the training examples at once?
A for loop in batch gradient descent is needed because you divide all examples per size of batch and then you need to go through all the batch examples and second step will be for all the batches in the dataset.
The model does not just see the whole set of samples once, but
num_iteration times. For example, if we set
num_iteration to 10 and do SGD with a dataset of 20 samples, there will be 10 x 20 = 200 gradient descent updates.
In all forms of Gradient Descent, the outer loop is over the number of “epochs” of training. One epoch is a complete pass through the entire training set and the application of the computed gradients to update the weights. Or think of it as “passes of training”. One pass is never enough and frequently it takes thousands or even tens of thousands of passes.
Once we introduce Minibatch Gradient Descent, then we divide the complete training set into smaller “minibatches” and we handle those one at a time in the inner loop. After each minibatch, we apply the gradients from that minibatch, so that we get more frequent updates to the parameters and (we hope) quicker convergence.
Then SGD is just the limiting case of MInibatch where the batch size is 1.
So there is always the outer loop of training epochs and then you optionally have the inner loop over the minibatches.
This was all explained in the lectures.