Isn’t the advantage of Mini batch gradient descent is to speed up the training process ? In that case why are we trying to “tune” it by using various values (which will further increase the time to train) ? How does the model learn different parameters with different minibatch sizes?

In the course it is said that when we use minibatch size = 1 we are losing the advantage of vectorizing. In that case shouldn’t there be a maximum mini batch size which depends on the maximum parellel processing that can be achieved ?

The point of minibatch gradient descent is that you get to update the parameters more often, so you should get faster convergence if you choose the minibatch size appropriately. In the limits on either end you get SGD (batch size = 1), which has the disadvantage you mention: you lose the benefits of vectorization. Or in the limit in the other direction you get minibatch size = the full training set, which gives you maximum vectorization, but you lose the benefit of more frequent updates to the parameters. The real point is that you hope there is a “Goldilocks” value somewhere in the middle that gives you something close to optimal convergence. As Yann LeCun famously said: “Friends don’t let friends use minibatch sizes > 32”

3 Likes