It seems that the normal process of approaching a minimum would reduce dw and db to make learning rate decay unnecessary?
You’re right, as we approach a minimum during training, decreasing the gradient will lead to smaller parameter updates, even if the learning rate remains constant. However, relying on decreasing gradients alone isn’t always sufficient for efficient convergence, and this is where learning rate decay can still be beneficial. Learning rate decay provides an additional mechanism to control update step sizes, improving convergence stability and performance. It’s widely used because it helps models reach better minima more efficiently.
Agree with @nadtriana.
In addition there is more research about ‘learning rate schedule’ where the learning rate may be increased at later epochs - this may have the effect of escaping a local minima and converging to another local minima. Since we usually maintain the model weights after each epoch we can obtain the weights for the model with the optimal loss. This reddit article has some discussion.
Imagine there is blind man who can only feel by his hands and he is walking with them to find the right place, given the nature of all different geographical landscapes out there in real life its better to use caution and change the size of the steps based on what you feel, just so you don’t fall in the abyss.
I just feel it is important to add, using a little @gent.spah’s metaphor, at first you want to take ‘big steps’ to get at the solution faster, but as you get closer you want to be more ‘gentle’ and tone it down and take tinier and tiner steps-- Otherwise you become like a bee buzzing around the hive but never quite arriving there (unless by luck).
The course programming example helped to demonstrate the answer. Yes. Yes learning rate decay can help.