I want to know why in the assignment do we ‘BatchNorm is normalizing the channels axis’. Because in the documentation it is stated that we usually normalize over the features axis.
The second question is, I understand the concept of normalization, but what is it like to normalize only on the channels axis compared to normalize over all axis or just say the m axis (training examples axis).
Thank you!