I managed to understand the batch normalization lecture, but I have some thoughts on the practical side.
Is BN helps in most cases? Is there a correlation to what activations being used? (Relu sigmoid tang …)?
My intuition is that BN helps when the activations are sigmoid or tanh whereas it doesn’t help relu because sigmoid and tanh have max derivative at 0, and approximately 0 derivative at (-inf,inf)
Is that correct?