Hi, I don’t understand this line from the week 3 programming assignment:
Note: When using sum of losses for gradient computation, it’s important to reduce the learning rate as the size of the mini-batch increases. This ensures that you don’t take large steps towards minimum.
Why exactly would it be a problem to take large steps in this case? Because with the LR being equal as for smaller mini-batches, the increase in step size that the larger mini-batch leads to becomes even more amplified and so we will have more difficulties in terms of convergence?
Is it then inversely a general rule of thumb that with smaller batch size, the learning rate should be increased?
I think I am confused by the batch size and LR params being looked at in isolation, given there is also e.g. LR decay.
Thank you for your help.