What is the main benefit of minibatch size

In the course, Andrew mentioned that the using minibatch increase efficiency, because we don’t need to go over all the training data to get one update. On the other hand, if the minibatch size is 1, we would lose the benefit of vectorization. From this, it seems like the main benefit of minibatch is efficiency, and not learning more (unlike increasing number of neurons, which may allow you to learn different numbers of features). And he mentioned that mini-batch size is a hyperparameter to tune.

My question is, to tune a hyper parameter, we would have to train the model with different values of that hyperparameter, and then compare. Suppose we are choosing among minibatch size 8, 16, 32, 64, each taking time of t8, t16, t32, t64 to converge, and t64 < t32 < t16 < t8. But to find out about this, we would have to train the model on minibatch size 8, 16, 32, 64, which means we would have spent t64 + t32 +t16 +t8 > t64. In this way, even if minibatch size 64 is most efficient for the model, the training process isn’t.

How does this work?

Hi Sara, in this case, to tune the mini-batch size, you effectively will have to choose a size and run the model, to see how it performs.
There are several ways to do it, for example, you can launch several models in parallel with different mini-batch sizes and monitoring how the loss is decreasing. Or you can launch one model, let it run for just some few epochs, and then run it again several times with other mini-batch sizes for a few epochs, compare how the loss functions performed and select the best mini-batch size for the full run.

The mini-batch concept applies mainly for big NN with a really big number of examples, where it has 2 main advantages:

  1. It does not wait to process all the input examples to start updating the weights which can accelerate the learning rate.
  2. A more technical performance requirement, where the NN calculations are performed in a GPU or TPU, you would like to load the NN and as much as the examples in the GPU or TPU for processing to get the performance increase they provide. And depending on the size of the NN, the quantity and size of the examples (think in :cat: photos), it may not be possible to load the full data set in the GPU/TPU, so choosing the mini-batch size is critical here.

I would just like to add to @javier’s answer that the size of a mini-batch affects the convergence process itself. For small Batch sizes learning is noisy, as the loss function could be unpredictable. For a larger batch size learning curve is smoother, which stables the learning, and makes it more predictable in the sense of understanding when the process converges.