Relationship between batch size and GPU memory

Hi. I’m attending the course ‘Generative AI with LLMs’ 1 week - “Computational challenges of training LLMs” The following calculation is clear, but I wonder if this is a pure model usage calculation just trainning for 1 sample? but what is the impact of different batch sizes on GPU memory? What are the specific impacts?

I can think of times when the results of different batch sizes in each layer need to be stored at least temporarily. It should be proportional to the batch size. Can anyone show it in a specific example? thank you.

1 Like

Hello @liangyi

That is such a good question, even I had this thought when GPU usage had affect on my model training, however I couldn’t find totally correlative analysis between parameters and bytes usage i.r.t. GPU. but today I found someone who did some digging

Give me some more time, if I find more relative matter on this, will share.

Regards
DP

1 Like

:grinning: haha thank u, there are a lot of instructions for the parameter nums and GPU memory calculation , but there is less information about the batch size , most people intuitively adjust when they see OOM i think :joy:

batch_size there are information, but parameter wise there isn’t much.

See this article where it tells on how to track GPU usage

But your query gave me an idea about doing an analysis parameter wise i.r.t. GPU.

Thanks and regards
DP

2 Likes

ok thank you :smile_cat: I’ll do some test as well

my question was related to this image Paul @paulinpaloalto

If the question is how to add the “dimension” of batch size to the above memory usage computations, then it requires a bit more thought. We have to consider what happens in the various intermediate calculations that we do in forward and back prop.

For the parameters themselves, there is no change, right? Because the size of the parameters is not affected by the batch size.

The final gradients themselves are also the same size as the parameters, of course. So those are not affected.

But then we have to think about all the intermediate steps (linear and non-linear activations in forward prop) and all the Chain Rule formulas in back prop.

All the forward propagation calculations involve the minibatch, e.g.

Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}
A^{[l]} = g^{[l]}(Z^{[l]})

So all the A and Z values there are number of neurons times number of samples.

And some of the back prop formulas also are affected by the batch size, e.g.:

dZ^{[L]} = A^{[L]} - Y
dW^{[L]} = \displaystyle \frac {1}{m} dZ^{[L]} \cdot A^{[L-1]T}

So the intermedate values there will be neurons times samples, but those are just temporary values that can be discarded.

So basically this makes my head hurt. :grin: The simpler approach would be to evaluate this experimentally. It looks like Deepti has found some really valuable info about how to monitor or get the status of the GPU. So you could just turn on all the GPU statistics gathering or add that to your training scripts and then try running first with minibatch size = 1 (Stochastic GD) to get the baseline memory usage, which should correspond to your chart above. Then run again with batch size = 2 or 4 or 8 and see how much the memory usage increases. With that info, you can then estimate what the maximum batch size is that you can use without getting the dreaded OOM errors given the size of the memory on your particular GPU.

I have not actually read the articles pointed to above, but does what I’m suggesting there sound like it would be doable?

2 Likes

Thank you :grinning: from Guangzhou China

1 Like