After using adam optimizer we prevent noise steps during convergence. So why we perform learning rate decay??
Common belief is that
decaying the learning rate helps the network converge to a local minimum and avoid oscillations.
However,
experiments suggest that they are insufficient in explaining the general effectiveness of lrDecay in training modern neural networks that are deep, wide, and nonconvex. We provide another novel explanation: an initially large learning rate suppresses the network from memorizing noisy data while decaying the learning rate improves the learning of complex patterns.
Source: