Here is diagram about pipeline parallelism training which take advantage of micro batch dataset on paralleled model on 4 devices.
The forward propagation is straightforward since the pipeline technology leverage the 4 devices and schedule the computation to make them more efficient. But the backward is confusion since all of the micro batch have come to the device 4 and according to the gradient formula we can compute the cost function for the whole mini-batch and perform the backward propagation directly which don’t incur any extra overhead, why do we still keep the cost for each micro batch and do the pipeline job on backprop?
Please familiarize yourself with gradient accumulation.
Gradient accumulation is employed when one wants to train a large network with limited memory. To provide a concrete example, if your hardware can support only a batch size of 8 but you want to train the NN with batch size of 32, you’ll accumulate the gradients for 4 mini batches and then perform backward pass. In fact, this is exactly what the top figure is doing.
Pay atttention to the right most column Update
. This means that each device performs update only after all gradients are accumulated. This is why the Update
action is stacked at the end of timeline. It’ll help to pay attention to the notation in the provided image. F_{i,j} refers to the forward pass of j^{th} micro batch passing through i^{th} layer of the network i.e. each of Device x
is responsible for training 1 layer of the network. B_{i,j} is for the backward pass.
I hope this provides clarity on how the image at the bottom makes more efficient use of the hardware.
It’s easy to understand the pipeline parallelism to optimize the computation just like you mentioned on forward propagation since it involve matrics multiplication. When all the micro batches reach the output layer (here it should happen at device 3) they’ll be reduced to a single number which is the derivative of cost and be used for backprop which do not consume accelerator memory.
Thus I don’t think there is any reason to still keep the micro batch and compute the derivative of each backprop to cumulate at each device, and there is even no need to stack the update at the last timestamp since each backprop step the weights in corresponding layer already updated.
Backpropagation doesn’t involve just the cost scalar. Computating gradients of weights requires activations from the previous layer. Please see how backpropagation works in course 1 week 4 assignment 1 of deep learning specialization to jog your memory.
Since the whole setup is memory constrained, we need to recompute the the activations again.
Hey @balaji.ambresh thanks for reminder of the assignment notebook, I did remember that it’s a combination of value (derivate) from later layer and value cached (here it’s the activation) and the cached value should be stored while performing the forward prop. Is that mean since not all the activation for a single mini-batch can be feed into accelerator for the backward calculation, we still keep the activation from different micro batch separated and store them at local memory or disk for each device respectively. While performing backprop the derivate from later layer received still kept micro batch separated, when doing the calculation just the activation related to the received derivate extracted and feed into accelerator and the result for each micro batch cumulated? Once all the derivate for the whole mini-batch received and processed the dw and db can be found from the cumulated result?
You got it. Once each layer gradients are accumulated, updates happen in parallel.
Since GPipe is not open source, I don’t know about the caching details.