In one of the labs on gradient descent, a value of 10,000 was chosen as the number of iterations. Is there a method to gauge what number to choose for this parameter? And how do we deem that convergence has occurred? Do we do it visually using graphs or is there a metric that we can use, like slope of the line tangent to the (cost vs iteration) curve?
Hello @stesoye
The number of iterations or epochs is a hyperparameter and there isn’t a single correct value that can be applied in all cases. It does involve a trail and error process.
One important aspect to keep in mind while setting an arbitrarily high value for the number of iterations is the matter of “Overfitting” - Overfitting renders the model less capable of making accurate predictions on new (unseen) data. Thus, It should not be assumed that setting an extremely high number of iterations (if hardware cost and training time is not an issue) will always be better - It could also have a detrimental effect on the accuracy of the model.
In addition to the training set, we can use a Validation set and check for the Error on the Validation set. If the error on the Validation set begins to increase as it goes through more iterations, then the model might be overfitting and it would be better to stop training at that point. This can then be used as a reference point for the optimal number of iterations/epochs for this particular dataset.
Hey @stesoye , just to add a few points based on @shanup’s reply:
Yes, we can. Sometimes people plot the “training curve” which is the errors of both training dataset and validation dataset at the end of each epoch of training. Example:
The validation line (orange) doesn’t drop after roughly 50 epochs but the training line continues to drop, so after 50 epochs, the model begins to overfit to the training dataset, but not generalizing itself better to the validation dataset. We would want to use the model at epoch 50, but not the model at a later epoch.
Yes, and besides manually look at graph like the above, there are ways to do it algorithmically using some metrics. The name of one such method is “early stopping” which is available in both neural network (NN) and decision tree (DT) ensembles. (You will come across NN and DT in C1 and C2). Early stopping is basically a rule, that if the validation dataset metrics does not improve over a certain number of rounds, then the training process should stop, because more training is only helping overfit. You can define the metrics to be anything you like, from loss functions that you see in the lecture videos, to other metrics of interest such as accuracy or precision.
Links below may not help you understand more about early stopping, but in my opinion it’s good to know that they exist, so that in the future, when you develop your project, you can try them out yourself.
Early stopping API for tensorflow
Early stopping option for xgboost
xgboost is a very popular decision tree ensemble package which will be talked about in C2 W4
And to close out with a word of caution:
Sometimes when you plot the training error and validation error, it can be a little wiggly. i.e., the graph can slightly go up and then again come down, and if you zoom into the graph you might even see this happen several times.
While it is important not to get into Overfitting and hence better to resort to early stopping, it is equally or even more important to be absolutely sure that the validation dataset metrics are beginning to deteriorate (or not changing any further) before we take the decison to exit - So, the first sign of a flat line or deteriorating performance should not be taken as a cue to exit from the training. Make sure that whatever we are seeing on the graph (or in the numbers) is not just a local aberration but a clear trend. And to get this certainty we can allow the training to continue for a few more epochs beyond the warning signs, before finally deciding on stopping and going back and retreiving the model parameters from a suitable epoch where we are sure that a clear flat line or deteriorating trend had started to show.
Thanks all. Very helpful.
Stephen
I realised that one of the lectures in Week 2 perfectly answers my original question and is also very specific to the original question. It doesn’t go into overfitting and validation sets (which have not been covered yet). The lecture is “Checking gradient descent for convergence”. Andrew shows how to use the graph to determine if convergence has occurred, but also introduces “epsilon” in the automatic convergence test. If anyone else has the same question as me it should be easy to understand in this lecture.
@stesoye, that’s great, thanks for letting us know!