Why is Batch Gradient Descent slower than Mini-batch Gradient Descent

chemgeostats · November 27, 2022, 12:48pm

From the lecture on Mini-batch GD, it is that it’s faster than batch GD is because it processes much fewer training data per iteration. This has 2 apparent benefits:

Mini-batch GD is able to update the parameters much more frequently than batch GD.

However, I’m not too convinced that this is the case, since for batch gradient descent, the parameters are implicitly being updated per each training sample as well. For example, in batch GD, b will be updated by db=1/m*np.sum(dZ, axis=1, keepdims=True) per iteration, which is nothing more than the sum of the gradients for each of the sample in the training set.

In fact, my understanding of the difference between batch GD and mini-batch/stochastic GD is not necessarily how often the parameters get updated (since they all seem to be updated for each sample in the training set), but rather how often the parameters change per each update.

For batch GD, the parameters do not change per each update, and only changes after all of the updates from all the training samples are added together
For stochastic/mini-batch GD, in the contrast, the parameter immediately changes per each update from each sample/mini-batch.

Therefore, perhaps this causes mini-batch GD to converge faster, since updating the parameters while they are frequently changing somehow causes the training to be faster, perhaps by avoiding local minima (as outlined by this explanation), and not necessarily due to it having more updates than batch GD.

Mini-batch GD uses much less RAM per iteration than batch GD (and is somehow faster because of this)

However,

If the training size for batch GD & mini-batch GD both fit comfortably into RAM, doesn’t batch GD be faster than mini-batch GD due to vectorization?
If training size for batch GD doesn’t fit into RAM, doesn’t it imply that training will not happen at all (and not necesarily slower) for batch GD? Or does that mean that the computer somehow knows to break the large training data to smaller fractions that can fit into RAM (in which case it’s doing quasi mini-batch GD except that it’s only updating the parameters after several mini-batches)? I’m not sure how a computer handles RAM for large training data so I’m really interested on how a real computer would behave in such a case.

Could you please help confirm if my understanding of the above advantages for mini-batch GD compared to batch GD is correct?
Thank you in advance for your clarifications!

Juan_Olano · November 27, 2022, 1:58pm

Hi @chemgeostats ,

We have Batch GD, which uses all the training data at the same time for each cycle of GD. We can use vectorization, which is a very efficient way to perform calculations by doing the calculation parallelly. However in huge training sets, using all the training data may not be possible due to limitations in the infrastructure.

Next we have Stochastic GD, which uses 1 sample at a time for each cycle of GD. This is an option when you have huge training sets. In this case, we would not be using vectorization on a single sample.

Finally, we have mini-Batch GD, which is a middle point between the two. We would split a huge training set into mini-batches, and for each mini-batch we can use vectorization.

The thing with vectorization is that it does calculations in parallel. And this parallelization is what makes the difference in speed.

Juan

Topic		Replies	Views
Why is Mini-batch Gradient Descent more efficient? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	555	December 24, 2022
Gradient steps in Mini batch vs batch Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	797	May 18, 2021
Gradient Descent vs Stochastic GD Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	586	October 21, 2021
DLS, C1_W4. 'Parameters vs Hyperparameters' lecture Neural Networks and Deep Learning week-4 , coursera-platform	5	353	March 12, 2024
What is the advantage of Mini batch gradient descent over batch gradient descent? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	566	May 17, 2021

Why is Batch Gradient Descent slower than Mini-batch Gradient Descent

Related topics