Batch Normalization

I’m a bit confused by the notation in this slide. I was of the opinion that the normalization would be carried out separately for each training example (so m means, m variances). But the summation variable i seems to be over the training examples. Do we calculate the mean and variance of all training examples together and use those values?

You are trying to normalize each feature independently. From a single example you get a single observation for each feature. So yes, you use whole mini-batches to compute their means and variances during training (and maintain moving averages that are used during inference).

Is there a difference between using $\sqrt{\sigma^2 + \epsilon}$ vs. $\sqrt{\sigma^2} + \epsilon$ similar to the Adam optimizer?

Both avoid divisions by zero, but I don’t know if one of the approaches works best in this particular context. Consistency is definitely important.

