Questions about "GPU RAM size needed to train 1B parameters"

Hi. I’m attending the course ‘Generative AI with LLMs’ 1 week - “Computational challenges of training LLMs”

about 2:00 in lecture,
It tells we need 80GB @32-bit Full precision.

But what I understand in this course 1:30 ~ 2:00 is

for 1 parameters, we need

  1. 4 bytes per parameter / Model Parameters (= weights)
    • 8 bytes per parameter / Adam Optimizers (2 states)
    • 4 bytes per parameter / Gradients
    • 8 bytes per parameter / Activations and temp memory

So, in training session, for 1 parameters, we need 4 + ( 8+4+8) = 24 bytes in max.

As the results, I calculate like below for 1B parameters model,
24 bytes x 1B = 24GB is needed.

but 2:00 in this lecture, We needs 80GB for training 1B models…

So … What am I missing?
Is there anyone who corrects me ?

Thanks in advance :slight_smile:

Hi @kjscop , thank you for your case. We have detected this and have sent it to the group in charge of content. This is being reviewed.

2 Likes

Hey, I too have same doubt? How we got 80 GB @ FP32 for 1B model parameters? Please share math behind this calculation

1 Like

I am still waiting on a reply from the group in charge of this :slight_smile:

1 Like

Thank you for updating. :slight_smile:

I believe 24 GB’s refers to the memory that is needed to hold the necessary pieces to run the LLM in memory which doesn’t include the memory needed to train it. More memory is needed to train.

This is how I understood it.

1 Like

“More memory” is like memory which is caused by train sample in batch ?

It sounds like the extra components of training can easily lead to 20x the amount that the weights alone take up.

I’m not sure of the details on what exactly is taking up all this memory. Still learning :slight_smile:

SOURCE
Computational challenges of training LLMs
Minute 1:21

"If you want to train the model, you’ll have to plan for additional components that use
GPU memory during training. These include two Adam optimizer states,
gradients, activations, and temporary variables needed by your functions.

This can easily lead to 20 extra bytes of memory per model parameter.
In fact, to account for all of these overhead during training,
you’ll actually require approximately 20 times the amount
of GPU RAM that the model weights alone take up."

Hope this helps.

1 Like

Actually, 24 GB should be right one. If you refer to this video posted in Deeplearning.AI YouTube channel, Efficient Fine-Tuning for Llama-v2-7b on a Single GPU .
They have taken Parameter, Gradient and Optimizer into account but not activation. They get 112 GB in total for 7 billion parameters.

If you take all these 4 (Parameter, Gradient and Optimizer and Activation) into account, we should get 168GB for 7 billion parameters.

So for 1 billion parameters it should be 24 GB