Implementing Batch Norm - Why not merging W and Gamma into W tilde

Hello,

You may have had the same intuition as me so I’d like to share this insight.

As we got rid of b using beta, when implementing Batch Norme, we could spare memory and calculation merging W and \gamma into \tilde{W} this way:

\tilde{W} = W \odot \gamma or \tilde{W} = W \times diag(\gamma)

Looking back in the course, in retrospect, I came to the conclusion that this could somehow spoil the regulating and exploration capabilities of the batch Norm despite a minimal theoretical performance gain.

What do you think of it?

BR

Course slide:

C2_W3_Implementing_Batch_Norm

1 Like

I have not thought about the mathematics of Batch Norm any more than just listening to what Prof Ng has said in the lectures. But when you apply it, there is another level of subtlety: at least as I understand the way the Keras APIs work, it looks like you can choose whether the batch norm logic runs in “training” mode independently of whether you are actually training your actual model parameters. In other words there may be another reason that you don’t want to “fuse” the W and \gamma values: even if you are running your model in inference mode using pretrained W values, you have the choice to let the BN params be dynamic based on the current dataset. I don’t know how much this is actually used in practice, but here is an article and tutorial about how to use the “training” attribute and transfer learning in general from François Chollet.

The other place to look to think further about your insights here would be to read the original paper about Batch Norm and see if they comment on anything that is relevant to your ideas.

Thanks @paulinpaloalto,

I will read the papers and come back to you.