I want to clarify my understanding of the point behind the number of epochs. So each epoch means the model has seen all the training data during training. During each epoch, gradient descent runs and finds the most appropriate coefficients that lead to the lowest loss. So what changes from one epoch run to the next?

Hello @Basira_Daqiq,

We go through the whole training set once in each epoch. In each epoch, the training set is further divided into a preconfigured number of mini-batches, and one gradient descent is performed over one mini-batch. Each gradient descent attempts to change the weights TOWARDS an optimal solution, but no single one of gradient descent guarantee it to reach the optimum. We need to distinguish between â€śchanging towardsâ€ť and â€śreachâ€ť.

The problem here is that the gradient descent does NOT reach the lowest loss. It only changes the weights TOWARDS the lowest loss. Therefore, one gradient descent doesnâ€™t guarantee it, and one epoch doesnâ€™t guarantee it either. If one epoch is not sufficient, we need another epoch. Thatâ€™s why we want more than one epochs.

Cheers,

Raymond

thank you!