One of the formulas suggested in the videos for the learning rate decay is
alpha_t = 0.95^t alpha_0 .
In this case the sum_{t=0}^infinity alpha_t is finite. Isn’t it a problem? Doesn’t it prevent the gradient flow from reaching its destination?
Hi, @psv.
It could, depending on the decay rate and the number of epochs you train your model (which is obviously finite).
If it does, just tweak your hyperparameters
1 Like
@psv
Ditto Ramon’s response.
In Adam optimization, the main hyperparameter you would be tuning is the alpha
, and it changes from one set of data to the another. As far as I know, there is no one-fits-all type of solution and they can vary in order of magnitude!
1 Like
Couldn’t agree more. Thanks, @suki