In the video about Orthogonalization, Andrew said he did not use early stopping because it’s less orthogonalized.
But without early stopping how can I find the best parameters?
When searching the best parameter, why don’t you use early stopping?
For example about learning rate, I can find the best result only when I use early stopping because the best epochs depends on learning rate. Fixed epoch is useless. Or are there other strategy?
I mean without early stopping how to find the best epoch when tuning some hyper parameters.
I think early stopping is just used for prevent the model from overfitting. You run the model and see at what epoch the overfitting starts and then choose that epoch.
Without early stopping you can also find out the best epoch just seeing when the validation error start increase. I think so.
Basically, the reason you don’t want to early stop is, by early stopping you are reducing training accuracy. And we normally want to approach our set bayes error. For image recognition tasks that is around 0%. Hence if you are at 96 percent accuracy, and you see that alright validation error is increasing so I should early stop. You might do that and get a higher test accuracy of say around 94%. But it will likely not be higher than training accuracy. And in this task 96 percent is not enough accuracy because we know we can do better.
Hence Andrew suggests to treat bias and variance as separate problems, the better technique is to reduce training error. Then look at bias and variance. So say you keep training till you achieve 99 percent accuracy. And now you run model on test set and see 94 percent accuracy. This is now solely a high variance problem, as you have attained low bias. Now you will concentrate on reducing variance by trying Regularization/increasing data etc. And say after all of this you get 99 percent training accuracy and 98 percent test.
That is a lot better than early stopping because you logically handled problems separately which told you a lot about what to do next.
I understand that first focusing to reduce bias and then focusing to reduce variance is better.
I didn’t think about reducing bias so much and too focused to prevent overfitting.