I’m a bit confused by the notation in this slide. I was of the opinion that the normalization would be carried out separately for each training example (so m means, m variances). But the summation variable i seems to be over the training examples. Do we calculate the mean and variance of all training examples together and use those values?

Hi, @bgoyal.

You are trying to normalize each feature independently. From a single example you get a single observation for each feature. So yes, you use whole mini-batches to compute their means and variances during training (and maintain moving averages that are used during inference).

Here are all the details in case you’re interested

1 Like

Is there a difference between using $\sqrt{\sigma^2 + \epsilon}$ vs. $\sqrt{\sigma^2} + \epsilon$ similar to the Adam optimizer?

Hi, @AbhijeetKrishnan.

Both avoid divisions by zero, but I don’t know if one of the approaches works best in this particular context. Consistency is definitely important.

1 Like