Andrew mentions one of the downside of early stopping is it couples the two tasks, optimization and regularization, so that it cannot complete each of them well simultaneously.

The argument actually makes sense. But I’m curious that, as overfitting is measured by the difference between the dev error and train error, if we find the smallest difference between two errors utilizing early stopping, isn’t that exactly what we want to achieve?

It’s an interesting set of issues and there aren’t “silver bullet” or “one size fits all” answers. I think Prof Ng’s point here is that if you are trying to tune your regularization, then early stopping is manipulating a different hyperparameter. In other words, the number of iterations and the \lambda value (assuming L2 regularization) are separate hyperparameters, but they would no longer be orthogonal with Early Stopping. Note that convergence is not guaranteed to be monotonic, so there might be an even better solution further out in terms of number of iterations that you might miss. In other words, things can diverge for a bit and then converge to an even better solution. The shapes of the cost surfaces here are pretty complex. We will learn more sophisticated techniques like Adam, RMS and dynamic learning rates later in this week.

1 Like

I see the point now, thank you.