I have a question about the content of the lecture~

Hi~
I don’t understand a lot of things, so if you know anything, please let me know.

Q1) In Distributed Data Parallel, how do gradients of model on each GPU be synchronize?
calculatating mean of each gradients updated? I want to know the way of synchronizing gradients.

Q2) In Fully Sharded Data Parallel, what does getting weights mean?
I understand that why getting weights is to execute an operation but forward pass is processed after getting weights. how is the operation executed after getting weights different from forward pass or backward pass?

Q3) Is ZeRO stage each step of the whole process, or is it a choice for users to use any of the three?

Q4) does the number of token literally mean total tokens in model’s vocabulary?

Hello @Nadle ,
I dont have all the answers, but I try to address one of them, concretely Q4.

Q4) does the number of token literally mean total tokens in model’s vocabulary?

In my understanding here the number of tokens refers to the training datasize. As you can see below, according to Chinchilla, recommended is ~20x the number of parameters. In the green box we can see numbers which match pretty good (LLaMa-65B), but the last 3 have too many parameters, in comparison to number of tokens in training dataset (or too few tokens in training dataset in comparison to training parameters)

I got it~
Thank you for your kind explanation!

My pleasure @Nadle :upside_down_face: