Hi everyone,
I have a conceptual question about Batch Normalization that I haven’t been able to find a clear answer to, either in the lectures or in common references.
As far as I understand, Batch Normalization during training:
-
Computes the mean and variance per batch
-
Normalizes activations to zero mean and unit variance
-
Then applies learnable scale (γ) and shift (β) parameters
In practice, subtracting the batch mean seems essential for centering activations and improving optimization. However, this made me wonder about a more edge-case scenario:
Has anyone ever encountered a situation where it was actually beneficial to keep the bias (mean) of the data and only normalize by the variance (i.e., skip mean subtraction)?
More concretely:
-
Are there known tasks, architectures, or data distributions where preserving the mean helped?
-
Or is mean subtraction in BatchNorm essentially always beneficial, with γ and β already covering any useful bias information?
-
If such cases exist, are they more common in specific settings (e.g., GNNs, time-series, physics-informed models)?
I’m asking mostly from an intuition and empirical-experience perspective rather than theory alone.
Thanks in advance — I’d be very interested to hear if anyone has seen this work in practice.
Behzad