Why is Mini-batch Gradient Descent more efficient?

In my understanding, mini-batches gradient descent break the whole batch into many mini-batches and then iterate all mini-batches. I would like to know why it is more efficient using a for loop for mini-batches rather than propagate through the all data in the vector? Thank you.

hi.Mini batch GD breaks the training set into number of mini batches.
After this it processes a single mini-batch and updates its parameters W1,b1,…,Wn,bn to optimize the cost function.
Mini batch GD does this thing for every mini-batch, i.e processes the mini-batch and then updates its params to optimize the cost.
While in Batch GD, all the training examples are processed altogether(which takes high amount of processing time) and then updates its params to optimize the cost.
So in mini-batch GD,learning starts shortly after the processing of just a single batch rather than waiting for the whole training set to get processed as in batch GD.

1 Like

Thank you @Mihir09 , I now see that mini-batch GD starts updating parameters quicker than batch GD.

1 Like

It seems like minibatches would allow us to end the training process early assuming the first few batches are representative of the average data set as a whole, because we would have already started making progress toward the minimum of the cost function quickly. Is early termination the reason that people says that minibatch gradient descent is “faster” than batch gradient descent?

Yes, the point of Minibatch GD is that you may be able to achieve the same level of convergence with fewer total “epochs” of training. Recall that the definition of an “epoch” is one complete pass through the full training set. Of course in full Batch GD, that just means one iteration, but in Minibatch it means one pass through all the minibatches.