Week 2 Quiz question 2

In Week 2, Quiz question, there is the below question:

{moderator edit - quiz question and answers removed}

why does my answer incorrect?, And what is the correct answer?

It would be cheating for us to tell you the right answer. But the reason that your answer is wrong is that minibatch has more overhead than full batch: there is another “inner” loop over the minibatches and you get less benefit from the vectorization since it is applied to smaller objects. So each “epoch” is actually more expensive in terms of the total compute cost. But what you hope is that you’ll end up needing fewer total epochs in order to get good convergence, because the weights get updated after each minibatch.

It’s been a while since I listened to the lectures here, but I would bet that Prof Ng discussed exactly this point that I just made in the lectures. If what I said above didn’t “compute” for you, you might want to go back and scan the transcript of the relevant lectures and see what Prof Ng says on this point.

Thank you for your response; I’m close to grasping the idea.

Great! The point of minibatch GD is that you get faster convergence, at the expense of higher compute costs. Think of the minibatch size as the “knob” that you can turn to modify the performance. At one end of the scale, you have full batch gradient descent and at the other end you have “Stochastic Gradient Descent” where the minibatch size = 1. So the smaller the minibatch size, the higher the compute cost, but the faster the convergence. But if you go all the way to the limit of batch size 1, then you also have the maximal compute cost (no benefit at all from vectorization) and the maximum amount of statistical noise in the updates: they may bounce all over the place since each one only depends on the behavior for one sample and you get no “smoothing” at all from any averaging. So the goal is to find the “Goldilocks” point at which you get the fastest convergence at the minimum cost. They have done some very large and careful studies of this across lots of different systems and the conclusion is that Yann LeCun had it right in his famous quote: “Friends don’t let friends use minibatch sizes greater than 32”. In almost all cases, the optimal size was somewhere between 1 and 32.

1 Like

Great! I got this part, this added alot to my understanding.
And i knew the answer, I was confuced with the term “mini_batch size is the same as training size”, i was thinking the number of mini-batches is as the number of the training examples.
But I am lucky to have this conversation with you, and so the nice information you provided. :heart: