C3W3. Why macro batches are faster than mini-batches processing?

ivan_100096 · August 26, 2021, 6:41am

Hey,
I have a question.
It’s a picture that explains why GPipe is after. Screenshot by Lightshot
And according to this article, if found information:

By pipelining the execution across micro-batches, accelerators can operate in parallel. In addition, gradients are consistently accumulated across micro-batches, so that the number of partitions does not affect the model quality.

But why accelerators can not operate in parallel for mini-batches?
Looks like for mini-batches accelerators should wait for each other, but for micro-batches - not. Why?

Thank in advance for the explanation

balaji.ambresh · October 13, 2022, 8:30pm

A micro batch is a mini batch, in theory. The reason for using a micro batch is for higher GPU utilization.

Here’s what happens for a single mini batch gradient descent:

Perform forward pass.
Calculate loss.
Update model parameters using the optimizer and gradient of the loss.

As you can see from the 3rd step, model parameters are updated after every mini batch. A micro batch takes this one step further. Instead of updating the model parameters after every micro batch, the loss is accumulated for every micro batch. Once the bigger block of data is exhausted, gradient is computed over all the exhausted micro batches. Since the same model is used over a chunk of micro batches, training is quicker.

Topic		Replies	Views
Gradient steps in Mini batch vs batch Improving Deep Neural Networks: Hyperparameter tun	4	695	May 18, 2021
Why is Mini-batch Gradient Descent more efficient? Improving Deep Neural Networks: Hyperparameter tun	4	553	December 24, 2022
Why is Batch Gradient Descent slower than Mini-batch Gradient Descent Improving Deep Neural Networks: Hyperparameter tun	1	539	November 27, 2022
A question regarding to the diagram regarding to pipeline parallelism Machine Learning Modeling Pipelines in Production	5	526	September 9, 2023
What is the advantage of Mini batch gradient descent over batch gradient descent? Improving Deep Neural Networks: Hyperparameter tun	1	562	May 17, 2021

C3W3. Why macro batches are faster than mini-batches processing?

Related topics