Does larger (not too large) learning rates always converge faster?

Hello everyone!
I was playing around with optional lab : Feature scaling and learning rate, and I found something interesting.
There were two learning rates in the lab that worked for the given examples and they were able to converge. But the interesting part was that the cost function for the smaller learning rate actually decreased faster with a small number of iterations. So I increased the iterations by 100 times and then the cost function of the larger learning rate became smaller than the one of smaller learning rate.
So I was wondering, is there an example that if we have two learning rates that both converge, can smaller one converge faster?

Hello @Nima1313!

Welcome to our community!! It seems you were having fun, and from your description, you had actually discovered one such example, hadn’t you? It is completely possible that given the same number of iterations, a smaller learning rate can perform better than a larger learning rate even through they both converge.

You have cleverly excluded the “too large” case, but if we do a thought experiment by considering a “pretty large” case that it is marginally converging (meaning that if it is a little bit larger, then it diverges), then it will still have a pretty hard time converging to an optimum because it is still pretty large. Well, I am too lazy to actually carry out the real experiment, but again, from your description, you have found that.

Keep trying, and cheers,
Raymond

1 Like

Hello @Nima1313

Since you are on an adventure, there is 1 more thing you could look at.

As per your experiment the smaller learning rate case has converged faster than the larger learning rate case - I assume this is what you saw.

If so, you should check out if the case with higher learning rate is bouncing around - This can be verified by checking if the derivative is changing sign for each update/iteration.