Gradient steps in Mini batch vs batch

In first video of week 2 , I did not understand how the single pass in batch gradient descent takes only 1 gradient step but in mini batch in single pass it takes 5000 gradient steps.

As I am understanding mini batch is run 5000 times ,and for each time it reduces loss function, update weights Correct me where I am wrong .
But I did not understand how it is become faster ?

I think you described it correctly: the point of MiniBatch is that you update the weights more frequently, so you make more rapid progress and end up having to do fewer full “epochs” of training. An “epoch” is one pass through the entire training set. If you do “batch” gradient descent, then the weights (parameters) get updated only once per epoch. With minibatch, the parameters are changed after each and every minibatch. If you make a wise choice of the minibatch size, then that should enable you to get the same level of convergence in fewer epochs. At least that is the intent.

1 Like

One question comes in my mind that, is it possible that in mini batch our loss is minimum before completetion of 5000 gradient steps ,suppose at the somewhere in between gradient descent is converge , if it is possible than it is waste to execute it longer , then we have break the execution in between , I don’t know if it is possible or not.

I think of mini-batch as making an estimate of the gradient by taking a (mini-batch-sized) small sample.
We can then get all of the normal advice from a statistics class about mechanisms to be a good estimate (and these are where practices such as shuffling the data come from).
Even the gradients from the full training set can miss what we’re truly after, parameters that generalize well to the real-world population. Our training/validation/test data are again just samples from that population.
So our gradient descent is always going to be somewhat ragged in its approach to a minimum.

Some mechanisms for training (such as training on multiple systems, so that each training system is using rather different parameter values) have even more raggedness.

This whole area has a weird mixture of complex math apparent precision and lots of pragmatic “this works” experience. I’ve made myself not worry, and hope that a few more decades we’ll understand better.

1 Like

I don’t think that is a serious concern. In real world cases, you are doing literally thousands or tens of thousands of “epochs” to get reasonable convergence so if you end up spending a partial epoch that was actually unneeded, it will look like a rounding error. Or to put it another way: if you think your cost is low enough part way through the first epoch, how do you know that’s not a premature decision: there may be samples that you haven’t seen yet that will require further updates to the weights, although (as Gordon mentioned) we try to shuffle the data before each subdivision precisely so that the statistical properties are relatively uniform. But if the minibatch size is small, a predictable side effect of that is greater noise (“jitter”) in the gradients that you are getting. Note that there’s nothing to stop you from implementing an “exit criterion” inside your minibatch loop: if the cost is less than some specified amount, then “break” before the end of that epoch.

But the actual reality here is that we will soon switch to using higher level packages like TensorFlow, Keras or PyTorch which handle all these issues internally for us. Prof Ng is just explaining the concepts, so that we understand what is going on “under the covers”. Once we make that switch, we no longer have to build minibatch GD, but it helps to know how it works so that we understand the implications of choosing hyperparameters like the minibatch size.

1 Like