I just wanted to mention an important point regarding convergence_test for gradient descent optimization algorithm. We assume a parameter epsilon for which if the change in J (cost function value) is less than that, we say the algorithm is converged.
What if we choose alpha (relaxing parameter) to be a very small value (e.g. 0.01), this might cause the change in cost function values to be less than epsilon, but the algorithm is not converged yet (it’s value is 10).

Isn’t it better to check change in derivative values (or sum of absolutes of derivatives)?

Looking for a minimum cost change is very intuitive, because this number is easy for us to interpret. For example, for linear regression, the cost is basically averaged squared errors. So if our true labels are in the range of 1 to 10, and if we wish the prediction error to be less than 0.1, then the averaged squared error should be in the order of 0.01, and therefore we can set epsilon to 1% of 0.01 which is 0.0001, so that if the improvement is smaller than 1% of my target, then such improvement does not worth more time to make.

However, looking for a minimum absolute value of derivatives is not intuitive, because we have no idea about the curvature of the cost space around a minimum, so it is hard for us to intuitively select a good threshold value for that.

Your concern about very small learning rate \alpha is valid, however, it is also our job to choose an appropriate \alpha. Choosing \alpha and epsilon are easier than choosing a threshold for derivatives.

The value of \frac {dJ} {dw} near the minimum could be so small that we would need to look into significant digits that are way below the order of magnitude that is easily discernible for us, to understand if any change is still happening…and yet even at such an infinitessimal value of \frac {dJ} {dw} there would still be a tangible change in cost.

So, if we had used \frac {dJ} {dw} in this scenario, we would have exited the training even while there was tangible update happening to the Cost, which essentially means that there was still scope for reducing the cost and hence improving the accuracy of the model.

On the other hand, directly tracking the cost rather than \frac {dJ} {dw} would give us a relatively better feel of whether the model is still improving or whether it has stagnated.