Why is Batch Gradient Descent slower than Mini-batch Gradient Descent

From the lecture on Mini-batch GD, it is that it’s faster than batch GD is because it processes much fewer training data per iteration. This has 2 apparent benefits:

  1. Mini-batch GD is able to update the parameters much more frequently than batch GD.

However, I’m not too convinced that this is the case, since for batch gradient descent, the parameters are implicitly being updated per each training sample as well. For example, in batch GD, b will be updated by db=1/m*np.sum(dZ, axis=1, keepdims=True) per iteration, which is nothing more than the sum of the gradients for each of the sample in the training set.

In fact, my understanding of the difference between batch GD and mini-batch/stochastic GD is not necessarily how often the parameters get updated (since they all seem to be updated for each sample in the training set), but rather how often the parameters change per each update.

  • For batch GD, the parameters do not change per each update, and only changes after all of the updates from all the training samples are added together
  • For stochastic/mini-batch GD, in the contrast, the parameter immediately changes per each update from each sample/mini-batch.

Therefore, perhaps this causes mini-batch GD to converge faster, since updating the parameters while they are frequently changing somehow causes the training to be faster, perhaps by avoiding local minima (as outlined by this explanation), and not necessarily due to it having more updates than batch GD.

  1. Mini-batch GD uses much less RAM per iteration than batch GD (and is somehow faster because of this)

However,

  • If the training size for batch GD & mini-batch GD both fit comfortably into RAM, doesn’t batch GD be faster than mini-batch GD due to vectorization?
  • If training size for batch GD doesn’t fit into RAM, doesn’t it imply that training will not happen at all (and not necesarily slower) for batch GD? Or does that mean that the computer somehow knows to break the large training data to smaller fractions that can fit into RAM (in which case it’s doing quasi mini-batch GD except that it’s only updating the parameters after several mini-batches)? I’m not sure how a computer handles RAM for large training data so I’m really interested on how a real computer would behave in such a case.

Could you please help confirm if my understanding of the above advantages for mini-batch GD compared to batch GD is correct?
Thank you in advance for your clarifications!

Hi @chemgeostats ,

We have Batch GD, which uses all the training data at the same time for each cycle of GD. We can use vectorization, which is a very efficient way to perform calculations by doing the calculation parallelly. However in huge training sets, using all the training data may not be possible due to limitations in the infrastructure.

Next we have Stochastic GD, which uses 1 sample at a time for each cycle of GD. This is an option when you have huge training sets. In this case, we would not be using vectorization on a single sample.

Finally, we have mini-Batch GD, which is a middle point between the two. We would split a huge training set into mini-batches, and for each mini-batch we can use vectorization.

The thing with vectorization is that it does calculations in parallel. And this parallelization is what makes the difference in speed.

Juan