In this question, it said that batch normalization can’t be used on single gradient descent. Did Professor mention that in the lectures? It also occurred to me that is there any other similar restrictions like momentum can only be applied to mini-batch gradient or RMSprop and Adam can not be used for single gradient descent? Could anyone help me to systemize on which optimizers can be applied to which gradient descent?
Hi, @1157350959.
I don’t think it was explicitly mentioned in the lectures.
Batch Normalization computes the mini-batch mean and variance for each activation, and that doesn’t make much sense if there’s a single example per mini-batch, does it?
Analogously, batch size plays an important role in optimizer performance, but I don’t think there are any restrictions as such.
Good luck with the rest of week 3!