I am confused as to the calculation of mu (the mean) in the Batch Normalization for hidden layers in a Neural Network.
First, for the purposes of the question, let’s assume our minibatch size is 64, and the number of activation nodes in our given layer l is 5.
So, in a vectorized implementation, Z[l] is of shape (5x64). We thus need to run this through our batch normalization before activation.
For the calculation of mu, over what dimension in Z[l] are we taking the average? Meaning, are we averaging across the 5 activations (the rows), or are we averaging across the 64 examples (the columns)? What would the resulting shape of mu be?
Thank you, to anyone who helps!