Repeat until convergence?

I must be missing something really obvious, but please bear with me…
At the end of week 1 we we calculate gradient descent as: repeat (decreasing w and b) until convergence.
I understand what it means, and the visualizations are very helpful, but i don’t see in the code how we test for convergence. Shouldn’t we be comparing updated cost function J (w, b) with the previous cost function J (w,b) to ensure it is still decreasing and we did not overstep the minimum? I see in the code we stop computing gradient descent when a fixed number of iterations is reached - but that number is arbitrary.
We have to be able to do it programatically?
thank you!

Since this is Week 1 of an introduction course, we don’t actually test for convergence in the code. It’s done visually from the cost history plot.

Hello @Svetlana_Verthein,

Great observation! For your information, that’s called “Early Stopping” and it isn’t covered in this course, however, the idea is just like what you have suggested. In particular, we want to compare the cost based on the cv set so that we will “early stop” when at least that the cv cost stops improving. Tensorflow implements this (here is the link), and I think you want to have a look at the list of parameters you can set, such as monitor, min_delta, and patience.

Cheers,
Raymond

Thank you, Raymond! This is very helpful. I’m in Week 2 now, and now I understand why convergence cannot be tested as simply comparing the new cost J to the previous cost of J (I thought that was all that was needed!)
It’s because if new J starts increasing, it may mean either: a) J minimum has been achieved or b) alpha is too large (or there is a bug in the code) - correct? Two very different scenarios.
Looking into Tensorflow EarlyStopping now - thanks for the link!

Another great point, but please let me adjust your words a little bit to the following:

if new J stops improving, it may mean either: a) J minimum has been achieved or b) alpha is too large (or there is a bug in the code) - correct? Two very different scenarios.

I want to make 2 points:

  1. When you are in Course 2 Week 3, you will come across an idea called “splitting a data set into a training set and a cv set”. You will find out why it’s important we evaluate our model on both the training set and the cv set under the cost function. Therefore, there WILL BE two J values and you will learn how to use them.

  2. The two scenarios that you mentioned are pretty common in a model training process. (a) happens when the model successfully converges to a (local) minimum. (b) happens when the model diverges.

Keep learning!
Raymond

1 Like