Week 2 Quiz question 2

Great! The point of minibatch GD is that you get faster convergence, at the expense of higher compute costs. Think of the minibatch size as the “knob” that you can turn to modify the performance. At one end of the scale, you have full batch gradient descent and at the other end you have “Stochastic Gradient Descent” where the minibatch size = 1. So the smaller the minibatch size, the higher the compute cost, but the faster the convergence. But if you go all the way to the limit of batch size 1, then you also have the maximal compute cost (no benefit at all from vectorization) and the maximum amount of statistical noise in the updates: they may bounce all over the place since each one only depends on the behavior for one sample and you get no “smoothing” at all from any averaging. So the goal is to find the “Goldilocks” point at which you get the fastest convergence at the minimum cost. They have done some very large and careful studies of this across lots of different systems and the conclusion is that Yann LeCun had it right in his famous quote: “Friends don’t let friends use minibatch sizes greater than 32”. In almost all cases, the optimal size was somewhere between 1 and 32.

1 Like