Why is the z_norm calculation done as

```
(z(i) - mean) / (sqrt(stddev ** 2 + epsilon))
```

If epsilon is just being used to avoid a pole, wouldn’t it be much simpler computationally to do

```
(z(i) - mean) / (abs(stddev) + epsilon)
```

Why is the z_norm calculation done as

```
(z(i) - mean) / (sqrt(stddev ** 2 + epsilon))
```

If epsilon is just being used to avoid a pole, wouldn’t it be much simpler computationally to do

```
(z(i) - mean) / (abs(stddev) + epsilon)
```

1 Like

In maths books they generally use the square root, right? I am not sure if abs is less computation expensive than sqrt! But in any case the resources used are 0(1).

The square → sqrt method should be much costlier in cpu instructions.

`abs`

will be O(1), but depending on the algorithm you’re using, square and sqrt will cost at best roughly O(n^1.5) where n is the number of digits in stddev.

1 Like

Where do you see this formula? I think the key is that standard deviation is never negative so there is no need to abs nor squared.

Raymond

Just a side comment:

Personally, I will not always add epsilon there, because if stddev is zero, it means that the column is a constant column which means it is useless. I will let it return null value so that I can detect it easily due to error or by my checking code.

However, I will add it if I need no errors.

Of course, we can always detect it with the epsilon there.

So, it depends.

Hello rmwkwow, this is the video with the formula I mention:

formula is at approximately 4:08

I realize after watching it again, the reason for the calculation is done as shown in the video is that the previous step computes the variance (sigma ** 2), and not just the stddev.

I see, @Aaron_Peschel . With the context of batch norm, what I said becomes irrelevant.

To add something relevant, yes, I agree with your latest point, and it is more important for the slide to show explicitly the square root and square there as they are necessary calculation steps - we are discussing a neural network process, and neural network uses back propagation, and for back propagation to work, we need to keep track of the calculation steps, including the square root and square, in order for us to compute the gradients.

Epsilon is completely necessary there because it is in a neural network. A neural network layer can use ReLU, and ReLU can make every value zero and that leads to a zero variance.

The Epsilon sets the lower bound of variance, and thus the upper bound of the reciprocal of standard deviation. An upper bound is good for stability.

Cheers,

Raymond

1 Like