Can Variance-Only Normalization Ever Outperform Standard Batch Normalization?

Hi everyone,

I have a conceptual question about Batch Normalization that I haven’t been able to find a clear answer to, either in the lectures or in common references.

As far as I understand, Batch Normalization during training:

  • Computes the mean and variance per batch

  • Normalizes activations to zero mean and unit variance

  • Then applies learnable scale (γ) and shift (β) parameters

In practice, subtracting the batch mean seems essential for centering activations and improving optimization. However, this made me wonder about a more edge-case scenario:

Has anyone ever encountered a situation where it was actually beneficial to keep the bias (mean) of the data and only normalize by the variance (i.e., skip mean subtraction)?

More concretely:

  • Are there known tasks, architectures, or data distributions where preserving the mean helped?

  • Or is mean subtraction in BatchNorm essentially always beneficial, with γ and β already covering any useful bias information?

  • If such cases exist, are they more common in specific settings (e.g., GNNs, time-series, physics-informed models)?

I’m asking mostly from an intuition and empirical-experience perspective rather than theory alone.

Thanks in advance — I’d be very interested to hear if anyone has seen this work in practice.

Behzad

Resnet is one of the architecture where only scaling the variance is sufficient for high performance.

usually in cases where mean is crucial and not part of noise, such heterogeneous data where mean distribution plays important significance in data distribution, also uses variance only normalisation instead standard BN

2 Likes