I’m confused by how weights are updated on multi GPU mirrored strategy (e.g., code for C2_W4_Lab2). I appreciate if someone can provide an explanation to this:
I understand that on each training step the gradients are calculated for the current set of weights, and changes are made to model.trainable_variables. But if training is performed in parallel on multiple GPUs, then how is this training synchronized?
If one GPU makes changes to model.trainable_variables, then how other GPUs can calculate the gradients at he same time without interfering the training?
As far as I remember at the end of each training cycle the parameters of each GPU are aggregated.
Exactly. During training, the mirrored strategy uses an all-reduce algorithm to aggregate the results from each device and broadcast back to each device to keep them in sync.
@Dror, there’s a good general overview of how parallelism works for the various strategies in the lecture, Types of distribution strategies
There’s also a little more specific info in the Tensorflow documentation for mirrored strategy
Thank you for your reply.
I think I got it by now.
What I wasen’t sure about is how the gradients are merged from the different GPUs to coherently update the weights. However if the total gradient is the sum of all gradients obtained for all training examples, then this makes sense.