Can anyone give me an explanation for the relationship between batch size and context window on the amount of memory that’s needed for the model? I can see that the higher the batch size and context window the more memory is needed but is there some calculation that approximates the impact?
I tried to search and ask the LLMs but I wasn’t successful in getting the answers - or at least understanding fully what I got.
let’s say 7B but is there some relationship that is a good estimate that takes into account the various options available? If it’s only just inference rather than training does that change anything?
Multiply the batch size by the number of trainable parameters, then add the non-trainable parameters. This tells you how much memory the model is going to occupy.
For training, you’ll need double the amount of memory for the trainable parameters, because it usually updates one copy while using the original copy for forward propagation.