Z Norm Calculation Question

Aaron_Peschel · December 4, 2023, 10:07pm

Why is the z_norm calculation done as

(z(i) - mean) / (sqrt(stddev ** 2 + epsilon))

If epsilon is just being used to avoid a pole, wouldn’t it be much simpler computationally to do

(z(i) - mean) / (abs(stddev) + epsilon)

gent.spah · December 5, 2023, 6:43am

In maths books they generally use the square root, right? I am not sure if abs is less computation expensive than sqrt! But in any case the resources used are 0(1).

Aaron_Peschel · December 5, 2023, 3:30pm

The square → sqrt method should be much costlier in cpu instructions.

abs will be O(1), but depending on the algorithm you’re using, square and sqrt will cost at best roughly O(n^1.5) where n is the number of digits in stddev.

rmwkwok · December 5, 2023, 3:39pm

Hi @Aaron_Peschel

Where do you see this formula? I think the key is that standard deviation is never negative so there is no need to abs nor squared.

Raymond

rmwkwok · December 5, 2023, 3:51pm

Just a side comment:

Personally, I will not always add epsilon there, because if stddev is zero, it means that the column is a constant column which means it is useless. I will let it return null value so that I can detect it easily due to error or by my checking code.

However, I will add it if I need no errors.

Of course, we can always detect it with the epsilon there.

So, it depends.

Aaron_Peschel · December 6, 2023, 5:30pm

Hello rmwkwow, this is the video with the formula I mention:

formula is at approximately 4:08

I realize after watching it again, the reason for the calculation is done as shown in the video is that the previous step computes the variance (sigma ** 2), and not just the stddev.

rmwkwok · December 6, 2023, 9:47pm

I see, @Aaron_Peschel . With the context of batch norm, what I said becomes irrelevant.

To add something relevant, yes, I agree with your latest point, and it is more important for the slide to show explicitly the square root and square there as they are necessary calculation steps - we are discussing a neural network process, and neural network uses back propagation, and for back propagation to work, we need to keep track of the calculation steps, including the square root and square, in order for us to compute the gradients.

Epsilon is completely necessary there because it is in a neural network. A neural network layer can use ReLU, and ReLU can make every value zero and that leads to a zero variance.

The Epsilon sets the lower bound of variance, and thus the upper bound of the reciprocal of standard deviation. An upper bound is good for stability.

Cheers,
Raymond

Topic		Replies	Views
C2W3 Differential addition of epsilon in Batch Norm and RMSProp Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	539	August 25, 2021
Batch normalization question Improving Deep Neural Networks: Hyperparameter tun week-module-3 , coursera-platform	3	107	May 21, 2024
The formula ambiguity about variance calculation for activation normalization layer Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	508	December 25, 2022
Batch Normalization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	564	May 31, 2021
Calc variance in batch normalization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	596	April 17, 2021

Z Norm Calculation Question

Related topics