Why do we process all minibatches in an epoch?

In mini-batch gradient decent we process a mini-batch, then the secon, and so on in every epoch.

Say there are 1000 epochs, why can’t we run 1 mini-batch for 1000 epochs, then the second batch for 1000 epochs, and so on?

This will reduce the need to swap the data in and out of GPU memory. The algoritm will learn like transfer learning where each mini-batch is sort of a slightly different distribution, and subsequent mini-batches build on top of one another.

Ref: coursera.org/learn/deep-neural-network/lecture/qcogH/mini-batch-gradient-descent

  

Also why is the cost function decrease not smooth?

Is it because the individual mini-batches are from slightly different distributions (or maybe from different parts of the same distribution), and hence have a slighly different notion of where global optima is?

You need to go through all of your data several times to learn from it. If you use only a minibatch per epoch you might be going one once through your data (or may a few times), but the model needs to go through it many times to effectively learn from it!

2 Likes

So eventually the model does go through all data.
Steps:

  1. Take mini batch 1 - train for 1000 epochs
  2. Take mini batch 2 - train for 1000 epochs

and so on.

So when the algorithm finishes, the network has seen all data.

The effect of doing it that way will be much different and I would bet will not work very well. The order matters. The point is that the minibatches are much much smaller than the complete data set. E.g. typically 32 entries or fewer. That means that they may not be statistically representative of the complete data set. So training repetitively on just one minibatch limits very much what the training can learn. The point of cycling through all the minibatches in every epoch is that the model learns from a statistically complete representation of the data on each epoch.

One other subtlety that is worth pointing out is that the standard practice is to randomly reshuffle the data on each epoch so that the minibatches will not be the same in each epoch. But the model will still see all the data in every epoch. That gives better statistical behavior and makes it more likely that that minibatch gradient descent will be able to reach good performance in fewer total epochs than if we did Full Batch Gradient Descent.

4 Likes

Tiny detail:

We don’t have to process all mini batches of underlying data in every epoch (please see this topic)

See this link for more onsteps_per_epoch parameter.

1 Like