Confusion Regarding Week 2 Video - 'Understanding Mini batch Gradient Descent'

During the lecture and while explaining, ‘Choosing your mini batch size’, Sir Andrew mentions that when the mini batch size is equal to ‘m’ (number of training examples), it is equal to batch gradient descent, and when the mini batch size is equal to 1, it equals to stochastic gradient descent. I feel a bit confused over this, as the explanation that follows implies the other way around, that is when the mini batch size is equal to ‘m’, it should be equal to stochastic gradient descent, and with a mini batch size of 1, it should be equal to gradient descent. Can you please check and clarify if this is an error in the course, or I am not understanding it correctly?

Hi Zeeshan,

Welcome to the community. Your initial understanding is correct:

  • Mini-Batch size = 1 then it’s equivalent to Stochastic Gradient Descent
  • Mini-Batch size = m then it’s equivalent to Batch Gradient Descent

Can you elaborate which part of the video led to the confusion?

I was having some trouble understanding what exactly the term ‘mini batch size’ is representing. I was confused if it was representing the number of training examples in one batch, or the total number of mini batches which will be formed. Its all clear now. The term is representing the number of training examples in one mini batch. Thank you! :smiley:

Hi Somesh.
I understand that when size = m we have batch gradient descent.
When size=1, we will take only one example to compute the gradient descent. Once done we will repeat with the next example and so on. We are not choosing randomly : we choose exactly the next example. Not randomly :
so I am confused with the terminology stochastic
If you can help. Regards

Hi Stephane,

That’s an excellent question and unfortunately I don’t have a good answer to it. Based on what I know in mini-batch gradient descent most libraries (like tf.keras) will shuffle the data by default before selecting the mini-batches (you can explicitly turn it off). This is similar to choosing samples at random without replacement.

I know this isn’t very convincing and I apologize. Let me check with other mentors for a better response.

Thanks

I believe it is true that it is normal practice to randomly shuffle the order of the samples before each Epoch when doing any flavor of minibatch GD (SGD or not). But the real point is that the stochastic behavior does not arise from randomly choosing the sample on a given iteration: it occurs because the samples are a statistical distribution, right? When minibatch size > 1, then you are averaging the gradients over the samples in each minibatch, so there is some statistical smoothing effect from that. In the case of batch size 1, the gradients can jump all over the place because you’re getting no smoothing from averaging the gradients across more than one sample. Every sample is different, right? The point of minibatch gradient descent is that you update the weights after every minibatch, so in the case of SGD the updates to the weights are more stochastic because you lose the averaging effect.

1 Like

Thanks Paul for the informative reply.