Hi, looks like regularization is trying to keep w small, to achieve that we add an extra part to cost function as below
So my question is can we achieve the same by increasing the learning rate during gradient descent instead of changing cost function.
This course is brilliant, thanks for your effort!
Can you elaborate more on how increasing the learning rate can keep w small?
Hi @rmwkwok , I think that is just a hypothesis, I haven’t verify it and it may not be correct. And increase the learning rate may cause over shooting and never converge, but from what I can see from the formula the w would decrease or increase more each step by increasing learning rate. So here I am trying to understand why the extra part added to the cost function would help keep w small, that might be an explanation from mathematic, but very much appreciated it if you could help me understand it in anyway.
Hypothesis is fine, but you can show us how you came up with it, right? Because no one shall prove your hypothesis for you.
Now, I assume the below is your “how” and it is what we can discuss:
The fact is, it only shows that a larger learning rate will minus w by a larger value, but it is not equal to “shrinking” w.
For example, if w = 0.1, you can minus it by 1,000 because the learning rate is large, resulting in w=-999.9.
Is -999.9 small? It is not.
For it to be qualified as small, |w| is small. In other words, -999 is as “large” as 999 in terms of their magnitudes.
Therefore, a large learning rate does not necessarily shrink w.
That explains it very much, thanks for your quick response!
You are welcome, @David_Long.