Why do we process all minibatches in an epoch?

Sahil_Singh1 · October 14, 2025, 6:25am

In mini-batch gradient decent we process a mini-batch, then the secon, and so on in every epoch.

Say there are 1000 epochs, why can’t we run 1 mini-batch for 1000 epochs, then the second batch for 1000 epochs, and so on?

This will reduce the need to swap the data in and out of GPU memory. The algoritm will learn like transfer learning where each mini-batch is sort of a slightly different distribution, and subsequent mini-batches build on top of one another.

Ref: coursera.org/learn/deep-neural-network/lecture/qcogH/mini-batch-gradient-descent

Also why is the cost function decrease not smooth?

Is it because the individual mini-batches are from slightly different distributions (or maybe from different parts of the same distribution), and hence have a slighly different notion of where global optima is?

gent.spah · October 14, 2025, 6:28am

You need to go through all of your data several times to learn from it. If you use only a minibatch per epoch you might be going one once through your data (or may a few times), but the model needs to go through it many times to effectively learn from it!

Sahil_Singh1 · October 14, 2025, 6:32am

So eventually the model does go through all data.
Steps:

Take mini batch 1 - train for 1000 epochs
Take mini batch 2 - train for 1000 epochs

and so on.

So when the algorithm finishes, the network has seen all data.

paulinpaloalto · October 14, 2025, 2:58pm

The effect of doing it that way will be much different and I would bet will not work very well. The order matters. The point is that the minibatches are much much smaller than the complete data set. E.g. typically 32 entries or fewer. That means that they may not be statistically representative of the complete data set. So training repetitively on just one minibatch limits very much what the training can learn. The point of cycling through all the minibatches in every epoch is that the model learns from a statistically complete representation of the data on each epoch.

One other subtlety that is worth pointing out is that the standard practice is to randomly reshuffle the data on each epoch so that the minibatches will not be the same in each epoch. But the model will still see all the data in every epoch. That gives better statistical behavior and makes it more likely that that minibatch gradient descent will be able to reach good performance in fewer total epochs than if we did Full Batch Gradient Descent.

balaji.ambresh · October 14, 2025, 4:16pm

Tiny detail:

We don’t have to process all mini batches of underlying data in every epoch (please see this topic)

See this link for more onsteps_per_epoch parameter.

Topic		Replies	Views
Possibility of loosing Effectiveness of large datasets in Mini Batch Improving Deep Neural Networks: Hyperparameter tun week-module-2 , coursera-platform	6	43	May 11, 2025
Gradient steps in Mini batch vs batch Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	860	May 18, 2021
Batch vs MiniBatch Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	545	November 1, 2021
Why is Mini-batch Gradient Descent more efficient? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	597	December 24, 2022
Mini-batch understanding Improving Deep Neural Networks: Hyperparameter tun coursera-platform	8	755	March 7, 2023

Why do we process all minibatches in an epoch?

Related topics