Are the learnable parameters for batch normalization increasing the time for training as we add more for gradient descent or the optimization in general??
Are the learnable parameters for batch normalization increasing the time for training as we add more for gradient descent or the optimization in general??
Please read this paper