I have a question about the content of the lecture~

Nadle · August 16, 2023, 1:29pm

Hi~
I don’t understand a lot of things, so if you know anything, please let me know.

Q1) In Distributed Data Parallel, how do gradients of model on each GPU be synchronize?
calculatating mean of each gradients updated? I want to know the way of synchronizing gradients.

Q2) In Fully Sharded Data Parallel, what does getting weights mean?
I understand that why getting weights is to execute an operation but forward pass is processed after getting weights. how is the operation executed after getting weights different from forward pass or backward pass?

Q3) Is ZeRO stage each step of the whole process, or is it a choice for users to use any of the three?

Q4) does the number of token literally mean total tokens in model’s vocabulary?

carloshvp · September 9, 2023, 2:23pm

Hello @Nadle ,
I dont have all the answers, but I try to address one of them, concretely Q4.

Q4) does the number of token literally mean total tokens in model’s vocabulary?

In my understanding here the number of tokens refers to the training datasize. As you can see below, according to Chinchilla, recommended is ~20x the number of parameters. In the green box we can see numbers which match pretty good (LLaMa-65B), but the last 3 have too many parameters, in comparison to number of tokens in training dataset (or too few tokens in training dataset in comparison to training parameters)

Nadle · September 21, 2023, 3:31am

I got it~
Thank you for your kind explanation!

carloshvp · September 21, 2023, 4:46am

My pleasure @Nadle

Topic		Replies	Views
Combining data parallelism with model parallelism Generative AI with Large Language Models quiz-help , week-module-1	2	301	July 30, 2025
FSDP Model Sharding: Where does Synchronization take place? Generative AI with Large Language Models week-module-1	1	13	September 30, 2024
Parameter Finetuning Post-training of LLMs	0	26	July 9, 2025
Weights update on multi GPU mirrored strategy Custom and Distributed Training with TF week-module-4	3	594	October 18, 2022
Questions about "GPU RAM size needed to train 1B parameters" Generative AI with Large Language Models week-module-1	8	3336	February 20, 2024

I have a question about the content of the lecture~

Related topics