I believe 24 GB’s refers to the memory that is needed to hold the necessary pieces to run the LLM in memory which doesn’t include the memory needed to train it. More memory is needed to train.
It sounds like the extra components of training can easily lead to 20x the amount that the weights alone take up.
I’m not sure of the details on what exactly is taking up all this memory. Still learning
SOURCE
Computational challenges of training LLMs
Minute 1:21
"If you want to train the model, you’ll have to plan for additional components that use
GPU memory during training. These include two Adam optimizer states,
gradients, activations, and temporary variables needed by your functions.
This can easily lead to 20 extra bytes of memory per model parameter.
In fact, to account for all of these overhead during training,
you’ll actually require approximately 20 times the amount
of GPU RAM that the model weights alone take up."
Actually, 24 GB should be right one. If you refer to this video posted in Deeplearning.AI YouTube channel, Efficient Fine-Tuning for Llama-v2-7b on a Single GPU .
They have taken Parameter, Gradient and Optimizer into account but not activation. They get 112 GB in total for 7 billion parameters.
If you take all these 4 (Parameter, Gradient and Optimizer and Activation) into account, we should get 168GB for 7 billion parameters.