Why would you still use weighted momentum and batch normalisation together? Is weighted momentum really necessary in that case? If yes, why?
My understanding is that batch normalisation will make weighted momentum redundant because, when the inputs to the hidden layers are also normalised, there will be no need for weighted momentum since the scaling are equal on all axis.
Beware that these two mechanism address two different problems:
- Weighted momentum makes the optimization algorithm less prone to get stuck in local minima (and therefore, more likely to reach global minima)
- Batch normalization solves the covariate shift in data. That is the variation in the data distribution between layers. Nonetheless, there is a bit of controversy, since some researchers argue that this is not really the reason, but softening the loss function (check this paper)
Exactly, That was my intuition that made me ask the question. Thank you for the paper.