If the question is how to add the “dimension” of batch size to the above memory usage computations, then it requires a bit more thought. We have to consider what happens in the various intermediate calculations that we do in forward and back prop.

For the parameters themselves, there is no change, right? Because the size of the parameters is not affected by the batch size.

The final gradients themselves are also the same size as the parameters, of course. So those are not affected.

But then we have to think about all the intermediate steps (linear and non-linear activations in forward prop) and all the Chain Rule formulas in back prop.

All the forward propagation calculations involve the minibatch, e.g.

Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}

A^{[l]} = g^{[l]}(Z^{[l]})

So all the A and Z values there are number of neurons times number of samples.

And some of the back prop formulas also are affected by the batch size, e.g.:

dZ^{[L]} = A^{[L]} - Y

dW^{[L]} = \displaystyle \frac {1}{m} dZ^{[L]} \cdot A^{[L-1]T}

So the intermedate values there will be neurons times samples, but those are just temporary values that can be discarded.

So basically this makes my head hurt. The simpler approach would be to evaluate this experimentally. It looks like Deepti has found some really valuable info about how to monitor or get the status of the GPU. So you could just turn on all the GPU statistics gathering or add that to your training scripts and then try running first with minibatch size = 1 (Stochastic GD) to get the baseline memory usage, which should correspond to your chart above. Then run again with batch size = 2 or 4 or 8 and see how much the memory usage increases. With that info, you can then estimate what the maximum batch size is that you can use without getting the dreaded OOM errors given the size of the memory on your particular GPU.

I have not actually read the articles pointed to above, but does what I’m suggesting there sound like it would be doable?