Learning rate vs. mini-batch size in sum of losses

Hi, I don’t understand this line from the week 3 programming assignment:

Note: When using sum of losses for gradient computation, it’s important to reduce the learning rate as the size of the mini-batch increases. This ensures that you don’t take large steps towards minimum.

Why exactly would it be a problem to take large steps in this case? Because with the LR being equal as for smaller mini-batches, the increase in step size that the larger mini-batch leads to becomes even more amplified and so we will have more difficulties in terms of convergence?

Is it then inversely a general rule of thumb that with smaller batch size, the learning rate should be increased?

I think I am confused by the batch size and LR params being looked at in isolation, given there is also e.g. LR decay.

Thank you for your help.

Please see this

Okay, I think I got it. Since we are not using the mean loss on a batch-size level to take a step, we are taking a step that is scaled up – by about the size of the batch. So assuming you have determined a learning rate for batch size of 5, then if you use batch size of 10, the loss scales up accordingly and you might want to use a lower learning rate.

Thank you!