Weights update on multi GPU mirrored strategy

Dror · October 15, 2022, 9:43am

Hi,
I’m confused by how weights are updated on multi GPU mirrored strategy (e.g., code for C2_W4_Lab2). I appreciate if someone can provide an explanation to this:
I understand that on each training step the gradients are calculated for the current set of weights, and changes are made to model.trainable_variables. But if training is performed in parallel on multiple GPUs, then how is this training synchronized?
If one GPU makes changes to model.trainable_variables, then how other GPUs can calculate the gradients at he same time without interfering the training?

gent.spah · October 17, 2022, 8:17am

As far as I remember at the end of each training cycle the parameters of each GPU are aggregated.

Wendy · October 17, 2022, 8:01pm

Exactly. During training, the mirrored strategy uses an all-reduce algorithm to aggregate the results from each device and broadcast back to each device to keep them in sync.

@Dror, there’s a good general overview of how parallelism works for the various strategies in the lecture, Types of distribution strategies
There’s also a little more specific info in the Tensorflow documentation for mirrored strategy

Dror · October 18, 2022, 5:00am

Hi,
Thank you for your reply.
I think I got it by now.
What I wasen’t sure about is how the gradients are merged from the different GPUs to coherently update the weights. However if the total gradient is the sum of all gradients obtained for all training examples, then this makes sense.

Topic		Replies	Views
C2W4:Issues with gradient calculation in Lab 2 Custom and Distributed Training with TF week-4	3	20	October 4, 2024
W4 mirrored strategy code walkthrough lecture Custom and Distributed Training with TF week-4	1	278	February 15, 2024
I have a question about the content of the lecture~ Generative AI with Large Language Models week-2	3	407	September 21, 2023
C2W4 Lab 2 Multi GPU Mirrored Strategy Loss Object Reduction is NONE rational Custom and Distributed Training with TF week-4	1	544	June 10, 2022
Running Notebooks on Local Windows Machine w/ Multiple GPUs Custom and Distributed Training with TF week-4	5	493	June 26, 2023

Weights update on multi GPU mirrored strategy

Related topics