Has anyone ever considered making the learning rate alpha adaptive to optimise convergence of the cost function to a global mininum?
Its occured to me that if initial values of the parameter vector w can be determined by first finding w which make the cost function a maximum then setting alpha to a high value like 0.8. Then as the cost function converges quickly to the global minimum with a large alpha, compute the derivative vector and reduce alpha as the absolute value of the derivative gets smaller. This way, the cost function should reach its global minimum the quickest possible way wthout overshooting the minimum or taking too many iterations.
Most APIs like scikit-learn that implement linear/logistic regression have implementations that allow you to set an initial learning rate and to tweak the learning rate âscheduleâ for gradient descent. Your intution is correct for convex loss functions with a distinct optimal solution - initially the weights are âfar away from the optimal weightsâ, so the weight updates may take us closer to the optimal weights if the learning rate is set to a higher value (as long as you donât overshoot the optima after the first weight update). However, the intuition is not valid if the loss (vs weights) behaves differently. There are other solutions like momentum that perform better on not-so-well-behaved loss functions.