Week 2 Quiz question 2

abdalla_ebrahim · July 30, 2023, 3:51pm

Hi,
In Week 2, Quiz question, there is the below question:

{moderator edit - quiz question and answers removed}

why does my answer incorrect?, And what is the correct answer?

paulinpaloalto · July 30, 2023, 4:33pm

It would be cheating for us to tell you the right answer. But the reason that your answer is wrong is that minibatch has more overhead than full batch: there is another “inner” loop over the minibatches and you get less benefit from the vectorization since it is applied to smaller objects. So each “epoch” is actually more expensive in terms of the total compute cost. But what you hope is that you’ll end up needing fewer total epochs in order to get good convergence, because the weights get updated after each minibatch.

paulinpaloalto · July 30, 2023, 4:40pm

It’s been a while since I listened to the lectures here, but I would bet that Prof Ng discussed exactly this point that I just made in the lectures. If what I said above didn’t “compute” for you, you might want to go back and scan the transcript of the relevant lectures and see what Prof Ng says on this point.

abdalla_ebrahim · July 30, 2023, 4:49pm

Thank you for your response; I’m close to grasping the idea.

paulinpaloalto · July 30, 2023, 5:08pm

Great! The point of minibatch GD is that you get faster convergence, at the expense of higher compute costs. Think of the minibatch size as the “knob” that you can turn to modify the performance. At one end of the scale, you have full batch gradient descent and at the other end you have “Stochastic Gradient Descent” where the minibatch size = 1. So the smaller the minibatch size, the higher the compute cost, but the faster the convergence. But if you go all the way to the limit of batch size 1, then you also have the maximal compute cost (no benefit at all from vectorization) and the maximum amount of statistical noise in the updates: they may bounce all over the place since each one only depends on the behavior for one sample and you get no “smoothing” at all from any averaging. So the goal is to find the “Goldilocks” point at which you get the fastest convergence at the minimum cost. They have done some very large and careful studies of this across lots of different systems and the conclusion is that Yann LeCun had it right in his famous quote: “Friends don’t let friends use minibatch sizes greater than 32”. In almost all cases, the optimal size was somewhere between 1 and 32.

abdalla_ebrahim · July 30, 2023, 7:05pm

Great! I got this part, this added alot to my understanding.
And i knew the answer, I was confuced with the term “mini_batch size is the same as training size”, i was thinking the number of mini-batches is as the number of the training examples.
But I am lucky to have this conversation with you, and so the nice information you provided.

Topic		Replies	Views
What is the advantage of Mini batch gradient descent over batch gradient descent? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	568	May 17, 2021
Problems with the answer of a question [Improving Deep Neural Networks \| Week 2 Quiz] Improving Deep Neural Networks: Hyperparameter tun quiz-help , week-module-2 , coursera-platform	2	22	June 15, 2025
Mini-batch understanding Improving Deep Neural Networks: Hyperparameter tun coursera-platform	8	688	March 7, 2023
DLS, C1_W4. 'Parameters vs Hyperparameters' lecture Neural Networks and Deep Learning week-module-4 , coursera-platform	5	353	March 12, 2024
What is the main benefit of minibatch size Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	551	May 9, 2021

Week 2 Quiz question 2

Related topics