In the happyModel, the code shows that there are 128 parameters for the batchnorm step. Shouldn’t it be 64 instead? 32 B’s and 32 gamma’s since we use 32 filters in Conv2D?
Great question!
The following illustration is borrowed from Group Normalization by Yuxin Wu and Kaiming He:
From Prof Andrew Ng’s lecture, we know that
As Yuxin Wu and Kaiming He’s illustration shows, axis=-1
means that we calculate the mean and standard deviation using all pixel values (width x height) over the entire batch for each channel.
In the Happy Model example, we have 32 channels, so we get 32 gammas and 32 betas. We also calculate 32 means and 32 standard deviations.
Running the example isolated in TensorFlow:
Consequently, the 64 trainable params are the gammas and betas, whereas the 64 non-trainable params are the means and standard deviations. In total, we have 128 parameters for the batch normalization layer.
Great thank you! This is a very clear explanation.