Week 2 Quiz question 2

paulinpaloalto · July 30, 2023, 5:08pm

Great! The point of minibatch GD is that you get faster convergence, at the expense of higher compute costs. Think of the minibatch size as the “knob” that you can turn to modify the performance. At one end of the scale, you have full batch gradient descent and at the other end you have “Stochastic Gradient Descent” where the minibatch size = 1. So the smaller the minibatch size, the higher the compute cost, but the faster the convergence. But if you go all the way to the limit of batch size 1, then you also have the maximal compute cost (no benefit at all from vectorization) and the maximum amount of statistical noise in the updates: they may bounce all over the place since each one only depends on the behavior for one sample and you get no “smoothing” at all from any averaging. So the goal is to find the “Goldilocks” point at which you get the fastest convergence at the minimum cost. They have done some very large and careful studies of this across lots of different systems and the conclusion is that Yann LeCun had it right in his famous quote: “Friends don’t let friends use minibatch sizes greater than 32”. In almost all cases, the optimal size was somewhere between 1 and 32.

Topic		Replies	Views
Mini-batch Gradient descent Improving Deep Neural Networks: Hyperparameter tun	3	540	August 27, 2022
What is the advantage of Mini batch gradient descent over batch gradient descent? Improving Deep Neural Networks: Hyperparameter tun	1	564	May 17, 2021
What is the main benefit of minibatch size Improving Deep Neural Networks: Hyperparameter tun	2	551	May 9, 2021
Gradient steps in Mini batch vs batch Improving Deep Neural Networks: Hyperparameter tun	4	789	May 18, 2021
Why is Mini-batch Gradient Descent more efficient? Improving Deep Neural Networks: Hyperparameter tun	4	554	December 24, 2022

Week 2 Quiz question 2

Related topics