Hey,
I have a question.
It’s a picture that explains why GPipe is after. Screenshot by Lightshot
And according to this article, if found information:
By pipelining the execution across micro-batches, accelerators can operate in parallel. In addition, gradients are consistently accumulated across micro-batches, so that the number of partitions does not affect the model quality.
But why accelerators can not operate in parallel for mini-batches?
Looks like for mini-batches accelerators should wait for each other, but for micro-batches - not. Why?
Thank in advance for the explanation
A micro batch is a mini batch, in theory. The reason for using a micro batch is for higher GPU utilization.
Here’s what happens for a single mini batch gradient descent:
- Perform forward pass.
- Calculate loss.
- Update model parameters using the optimizer and gradient of the loss.
As you can see from the 3rd step, model parameters are updated after every mini batch. A micro batch takes this one step further. Instead of updating the model parameters after every micro batch, the loss is accumulated for every micro batch. Once the bigger block of data is exhausted, gradient is computed over all the exhausted micro batches. Since the same model is used over a chunk of micro batches, training is quicker.