When creating a post, please add:
video link (
1 → https://www.coursera.org/learn/deep-neural-network/lecture/qcogH/mini-batch-gradient-descent
2 → https://www.coursera.org/learn/deep-neural-network/lecture/lBXu8/understanding-mini-batch-gradient-descent)
My understanding of Mini Batch:
Instead of processing all input items at a time, we divide total items into multiple batches and apply forward, backward propagation one by one for all batches once for one epoch
My concern
We may have different kind of data, lets assume for cat data
1 → we may have only cats in the image
2 → we may have cats with other objects
3 → we may have cats with 80% of the body in the image.
4 → we may have cats with noise images.
5 → we may have cats with different width and heights adjusted to meet our network input
and so on.
What I meant to say here is we have images who fall into different groups
After mini batch division
lets assume we have one kind of data in first batch and completely different
kind of data in last batch.
Since we are processing by batches weights adujt to batch images instead of all images.
At the end of epoch we will get weights that are well suited for last batch and this happens for every epoch.
Yes we are processing all images but the end of epoch are adjusted to last batch more.
Please help me understand Is that not a loss?
It’s a great question. It’s been several years since I watched the lectures here, but I’m pretty sure that Prof Ng addresses this point, although perhaps he discusses it in a bit more general way than you describe. The advantage of minibatch processing as opposed to full batch is that the learning is faster because the weights are adjusted after processing each minibatch. That does potentially introduce more noise into the process since (as you describe) you might have different minibatches that have images with different characteristics. There are a couple of ways that you can mitigate that issue:
- In the standard implementation of minibatch, you randomly shuffle the full dataset on each epoch before you select the minibatches, so that has the effect of smoothing out the statistical behavior. You won’t get a fixed pattern in the behavior of the minibatches from one epoch to the next.
- The size of minibatch that you use is a hyperparameter. Depending on the characteristics of your dataset, you may find that you get better results with larger minibatch sizes (64 or 128 or 256) as opposed to smaller ones (32 or 16 or even stochastic GD with batchsize = 1).
The general rule of thumb that most people use is expressed in the famous Yann LeCun quote: “Friends don’t let friends use minibatches larger than 32.”
But as always YMMV and experimentation is required. 
1 Like
But note that the learning is always cumulative, right? Sure the last minibatch has some effect, but so did all the previous ones. And as mentioned in my earlier post, the last minibatch is different on every epoch. Of course you also have to pick a reasonable learning rate, so that you’re not bouncing around too much in every step you take.
1 Like
I agree with Paul. Even if we do batch gradient descent (giving it the whole training set each time), we would not expect the model to converge to a solution in one step, which means in the mini-batch setting, we would not expect the model to be completely adapted to each latest mini-batch, be it the last one or any one in the middle.
Instead, my picture would be that, during an epoch, as we give it different mini-batches, we effectively keep changing the cost landscape which means we keep moving the minimums around, and this is because each mini-batch defines its own minimums. I would imagine the result is that, the model could never really settle in any one of those mini-batches but swing somewhere in between.
Cheers,
Raymond
1 Like
One other thing worth mentioning is that this is one technique that is basically universally used. You may have to tweak the batch size and the learning rate and whether you use Adam or Momentum, but everybody does minibatch.
Thanks for the reply. I understand that shuffling helps us to create good mini batches.
I’m new to deep learning, but have good exposure in software. In software we usually create failsafe for this kind of scenarios.
Here is my idea,
Can we have multiple shufflings at different epochs (to my knowledge shufflings is not a very costly op)
draw some metrics on cost changes at different shuffling, may be be create some params that gives us knowledge on the how good is our learning and update hyper params instead of retraining entire dataset one more time. I understand that multiple shufflings effect is same as single shuffling but params they can be helpful.
May be above thought could be not useful or is covered in next courses.
For today I take that you might have worked on different datasets of different domains in your experience and mark the reply as solution.
Yes, that’s exactly the point: before every epoch, we do a new random shuffle. So on every epoch (one full pass through the complete training set) we have different minibatches. And yes you’re right that the shuffling is a very inexpensive operation.
I think I’m probably just missing your point here, but note that of course the cost changes continuously at every minibatch because we are doing Gradient Descent here. We update the weights after each minibatch.
One of my points in the earlier replies is that minibatch Gradient Descent is very widely used. So if your question is “are we really sure this is a workable technique”, the answer is yes and please realize that this is not some statement that I am making based on my personal experiences: we see it all throughout the courses here from this point onwards.