I am confused as to the calculation of *mu* (the mean) in the Batch Normalization for hidden layers in a Neural Network.

First, for the purposes of the question, let’s assume our minibatch size is 64, and the number of activation nodes in our given layer *l* is 5.

So, in a vectorized implementation, Z[*l*] is of shape (5x64). We thus need to run this through our batch normalization before activation.

For the calculation of *mu*, over what dimension in Z[*l*] are we taking the average? Meaning, are we averaging across the 5 activations (the rows), or are we averaging across the 64 examples (the columns)? What would the resulting shape of *mu* be?

Thank you, to anyone who helps!