Calculation of the Mean (mu) in Batch Norm

I am confused as to the calculation of mu (the mean) in the Batch Normalization for hidden layers in a Neural Network.
First, for the purposes of the question, let’s assume our minibatch size is 64, and the number of activation nodes in our given layer l is 5.
So, in a vectorized implementation, Z[l] is of shape (5x64). We thus need to run this through our batch normalization before activation.
For the calculation of mu, over what dimension in Z[l] are we taking the average? Meaning, are we averaging across the 5 activations (the rows), or are we averaging across the 64 examples (the columns)? What would the resulting shape of mu be?

Thank you, to anyone who helps!

Hi Steven_Zayas,

Thanks for your great question about this issue which confused me too.

Reading this post made it clear that averaging is in fact done over the number of samples - which makes sense if you think about it, as you want to standardize the level of activation per node. The resulting shape of mu will be (5, 1).