What is the meaning of "reduce the number of training epochs

I saw this on a lab, what does that mean

3 Likes

Hello @charindith

An “epoch” means one pass of the whole training set. If my training set has 10 samples, then running 20 epochs means my model will be trained on these 10 samples for 20 times. However, if I copy the samples 4 times so that now my training set becomes 40 samples (4 copies for each of the 10 samples, 4 x 10 = 40), then I will only need to run 5 epochs for my model to be trained on the 10 samples for the same 20 times. Therefore, making the copies can reduce the number of epochs.

One reason we preferred 5 epochs of 40 samples over 20 epochs of 10 samples was that the former can save some time. There is some overhead switching over from one epoch to the next, and thus having less epochs can save some time.

One reason we preferred 5 epochs of 40 samples over 1 epoch of 200 samples (20 copies of each of the 10 samples) was that we didn’t want to completely get rid of that overhead. That overhead includes tracking metric and/or loss performance from one epoch to the next, so that it allows us to design algorithm to early stop the training process when some performance criteria is met.

Cheers,
Raymond

PS1: The number of gradient descents performed in one epoch depends on the mini-batch size. After copying data, if the total number of samples is 40 and the mini-batch size is 2, then there will be 20 gradient deacents in one epoch.

7 Likes

Hi @rmwkwok,

I didn’t quite understand this statement. Correct me if my understanding is right

If we are to use batch gradient descent, where the entire dataset is used to compute the gradient in each epoch, then the number of gradients calculated would be equal to the number of epochs. In this case, with 5 epochs, we would be calculating 5 gradients in total.

However, if we were to use a different optimization algorithm such as stochastic gradient descent or mini-batch gradient descent, where only a subset of the data (batch) is used to compute the gradient at each step, then the number of gradients would be different.

For example, if you are using mini-batch gradient descent with a batch size of 10 (i.e., updating the weights after processing 10 samples at a time), you would calculate 4 gradients per epoch because 40 samples divided by a batch size of 10 equals 4 batches. Therefore, over 5 epochs, you would calculate 20 gradients in total.

Hello @bhavanamalla,

Everything you said is correct. However, you said:

Which statment is it? Is it my PS1? If so, it is not contradictory to what you have said, because even though theoretically we have different names (batch GD, mini-batch GD, and stochastic GD) for different sizes of samples in one GD, in practice, it is controlled by one hyperparameter called batch size.

What I am saying is, for example, in Tensorflow, we don’t configure it to be batch GD, mini-batch GD, or stochastic GD, instead, we configure the batch size.

Cheers,
Raymond

Thanks for the clarification

Hello Raymond !
Thank you for your generous answer, please can you tell me what is an overhead ?

Here is the wiki definition for the term overhead, but for example, in switching from one epoch to the next epoch, a Tensorflow training process may evaluate the model on a so-called “evaluation dataset”, besides, it may also do a so-called “early-stopping” check to see if the training should stop. These checkings make no effect on the trainable parameters of the neural network, but are required to carry out on user’s (my) choice in between two epochs.

1 Like

@rmwkwok Why we need to train the same samples 20times?

@Nithin_Kumar_A , what do you think?

if we are changing the weights(w) ,b every time then the chances of it converging is higher which improves model accuracy

1 Like

That is a good observation! Gradient descent is an algorithm that “walks step by step” to an optimal solution.

Before we train, we don’t know how many steps are needed, and we cannot guarantee that training on the dataset once will result in enough steps that get us to the solution. Instead, it is usual for us to have to train on the dataset more than once. The number “20” was an arbitary number in my example. In practice, we need to monitor the model’s latest performance as gradient descent steps towards. It might ends up requiring only 10 epochs, or it might end up requiring 30 epochs, and they depend on factors such as the learning rate, the model architecture and the dataset itself.

Again, it is important that we do not try to remember the number “20” as a rule or a standard or something like that. Instead, for every problem you come across, you need to figure out the number of training epochs required by monitoring the model’s latest performance.

2 Likes

Thank you very much !

1 Like