In the video “Supervised Machine Learning: Regression and Classification → Week 1 → Learning rate”, 3:46
Why does the point jump away (above) when the alpha is too big? I thought it should jump around, instead of jumping away.
In the video “Supervised Machine Learning: Regression and Classification → Week 1 → Learning rate”, 3:46
Why does the point jump away (above) when the alpha is too big? I thought it should jump around, instead of jumping away.
Ideally each updated of ‘w’ would move the cost toward the minimum.
But if the magnitude of the change in ‘w’ is too large, then the cost could go past the minimum and climb up the opposite side of the curve. Then the next corrected ‘w’ value would over-correct back the opposite direction again.
But its not really a 2D hyperbolic curve its high dimensional i.e. its not bound in all directions with the same height of the “ditch”, if the learning rate is big enough it might surpass the minima and get out of the ditch!
considering the right side of a perfect parabolic descent for simplicity gradient descent will be like
w := w - \frac{Loss}{w} and it will not jump across minimum as long as w - \frac{Loss}{w} > 0
as w → really small ==> \frac{Loss}{w} → \infty ( i.e it becomes large ) → w - \frac{Loss}{w} tends to become negative(in the regions where the curve is steep) when trying to move towards zero . It means it will have higher chances of jumping across the minimum where the curve is really steep and that is where a smaller value of \alpha will be required to decrease this large value of the slope preventing the jumps across the minimum.
There won’t be any jump when w is large and thus \frac{Loss}{w} will be a small number resulting in slow learning, in this scenario a larger value of \alpha will be required to increase the rate of learning.
if we increase \alpha in either of the two scenarios, it will increase the second part of the equation, \frac{Loss}{w} \times \alpha and hence w := w - \frac{Loss}{w} \times \alpha will become -ve( it will move past 0 on the other side)
Perhaps this is how Adam algorithm manages alpha and if \alpha is choosen carefully Adam may not be required for simple curves/expressions but it is always better to use Adam or something which manages alpha always i think.
Please correct me if i am wrong.
The complexity of setting a good value for alpha is why this fixed-rate method isn’t used very often.
But it’s a good first introduction to using gradients to find the minimum of a convex function.