I’m a bit confused by the notation in this slide. I was of the opinion that the normalization would be carried out separately for each training example (so m means, m variances). But the summation variable i seems to be over the training examples. Do we calculate the mean and variance of all training examples together and use those values?
Hi, @bgoyal.
You are trying to normalize each feature independently. From a single example you get a single observation for each feature. So yes, you use whole mini-batches to compute their means and variances during training (and maintain moving averages that are used during inference).
Here are all the details in case you’re interested
1 Like
Is there a difference between using $\sqrt{\sigma^2 + \epsilon}$ vs. $\sqrt{\sigma^2} + \epsilon$ similar to the Adam optimizer?
Hi, @AbhijeetKrishnan.
Both avoid divisions by zero, but I don’t know if one of the approaches works best in this particular context. Consistency is definitely important.
1 Like