What is the meaning of "reduce the number of training epochs

charindith · July 16, 2023, 12:53pm

I saw this on a lab, what does that mean

rmwkwok · July 16, 2023, 1:31pm

An “epoch” means one pass of the whole training set. If my training set has 10 samples, then running 20 epochs means my model will be trained on these 10 samples for 20 times. However, if I copy the samples 4 times so that now my training set becomes 40 samples (4 copies for each of the 10 samples, 4 x 10 = 40), then I will only need to run 5 epochs for my model to be trained on the 10 samples for the same 20 times. Therefore, making the copies can reduce the number of epochs.

One reason we preferred 5 epochs of 40 samples over 20 epochs of 10 samples was that the former can save some time. There is some overhead switching over from one epoch to the next, and thus having less epochs can save some time.

One reason we preferred 5 epochs of 40 samples over 1 epoch of 200 samples (20 copies of each of the 10 samples) was that we didn’t want to completely get rid of that overhead. That overhead includes tracking metric and/or loss performance from one epoch to the next, so that it allows us to design algorithm to early stop the training process when some performance criteria is met.

Cheers,
Raymond

PS1: The number of gradient descents performed in one epoch depends on the mini-batch size. After copying data, if the total number of samples is 40 and the mini-batch size is 2, then there will be 20 gradient deacents in one epoch.

bhavanamalla · July 18, 2023, 4:34pm

Hi @rmwkwok,

I didn’t quite understand this statement. Correct me if my understanding is right

If we are to use batch gradient descent, where the entire dataset is used to compute the gradient in each epoch, then the number of gradients calculated would be equal to the number of epochs. In this case, with 5 epochs, we would be calculating 5 gradients in total.

However, if we were to use a different optimization algorithm such as stochastic gradient descent or mini-batch gradient descent, where only a subset of the data (batch) is used to compute the gradient at each step, then the number of gradients would be different.

For example, if you are using mini-batch gradient descent with a batch size of 10 (i.e., updating the weights after processing 10 samples at a time), you would calculate 4 gradients per epoch because 40 samples divided by a batch size of 10 equals 4 batches. Therefore, over 5 epochs, you would calculate 20 gradients in total.

rmwkwok · July 18, 2023, 9:59pm

Hello @bhavanamalla,

Everything you said is correct. However, you said:

Which statment is it? Is it my PS1? If so, it is not contradictory to what you have said, because even though theoretically we have different names (batch GD, mini-batch GD, and stochastic GD) for different sizes of samples in one GD, in practice, it is controlled by one hyperparameter called batch size.

What I am saying is, for example, in Tensorflow, we don’t configure it to be batch GD, mini-batch GD, or stochastic GD, instead, we configure the batch size.

Cheers,
Raymond

bhavanamalla · July 19, 2023, 10:36am

Thanks for the clarification

manalchetouani · September 20, 2023, 11:24am

Hello Raymond !
Thank you for your generous answer, please can you tell me what is an overhead ?

rmwkwok · September 20, 2023, 9:33pm

Here is the wiki definition for the term overhead, but for example, in switching from one epoch to the next epoch, a Tensorflow training process may evaluate the model on a so-called “evaluation dataset”, besides, it may also do a so-called “early-stopping” check to see if the training should stop. These checkings make no effect on the trainable parameters of the neural network, but are required to carry out on user’s (my) choice in between two epochs.

Nithin_Kumar_A · September 24, 2023, 10:50am

@rmwkwok Why we need to train the same samples 20times?

rmwkwok · September 24, 2023, 10:55am

@Nithin_Kumar_A , what do you think?

Nithin_Kumar_A · September 24, 2023, 11:32am

if we are changing the weights(w) ,b every time then the chances of it converging is higher which improves model accuracy

rmwkwok · September 24, 2023, 12:13pm

That is a good observation! Gradient descent is an algorithm that “walks step by step” to an optimal solution.

Before we train, we don’t know how many steps are needed, and we cannot guarantee that training on the dataset once will result in enough steps that get us to the solution. Instead, it is usual for us to have to train on the dataset more than once. The number “20” was an arbitary number in my example. In practice, we need to monitor the model’s latest performance as gradient descent steps towards. It might ends up requiring only 10 epochs, or it might end up requiring 30 epochs, and they depend on factors such as the learning rate, the model architecture and the dataset itself.

Again, it is important that we do not try to remember the number “20” as a rule or a standard or something like that. Instead, for every problem you come across, you need to figure out the number of training epochs required by monitoring the model’s latest performance.

manalchetouani · September 26, 2023, 12:50pm

Thank you very much !

Topic		Replies	Views
Mini-batch understanding Improving Deep Neural Networks: Hyperparameter tun	8	669	March 7, 2023
[Mini-batch gradient descent] Did Andrew mean "epoch" instead of "iteration"? Improving Deep Neural Networks: Hyperparameter tun	4	620	July 7, 2021
Epoch clarification Advanced Learning Algorithms week-3	2	400	August 5, 2023
1 epoch pass through whole training set or just 1 mini-batch? Improving Deep Neural Networks: Hyperparameter tun	3	521	September 5, 2022
How tile data can reduce the number of training epochs? Advanced Learning Algorithms week-2	2	571	July 18, 2023

What is the meaning of "reduce the number of training epochs

Related topics