What is meant by "When freezing layers avoid keeping track of statistics (like in the batch normalization layer)"?

Could someone please help me understand this sentence? (from W2 mobilenet assignment)

I understand that one might want to freeze some layers to only train specific layers or to use the model for inference (i.e. by freezing all layers).

I do not understand two aspects of this sentence:

  1. Why batch normalization is specifically mentioned (I seem to recall something to do with batch normalization freezing was briefly mentioned in the videos but could not find it again).
  2. How one avoids keeping track of statistics while freezing layers, i.e. ad absurdum, could someone actually keep track of statistics while freezing layers?

Thank you for your help!
Daniele

1 Like

Hi, Daniele.

The key point to realize here is that Batch Norm is also trainable, but that is separate from whether the usual “parameters” of the various layers are trainable. So what they are saying is to make sure to do two things:

  1. Freeze the parameters of the layer (weight and bias values).
  2. Disable the training of the batch normalization constants as well (those are the “statistics” they are referring to, since BN is based on the mean and variance of the data in each minibatch).

The mechanisms you use to control those two types of training are different. All this is not very thoroughly explained in the course materials, but the Keras documentation is pretty complete if you can find the relevant sections. Here’s a thread that has pointers into the Keras documentation about Transfer Learning.

Hi Paul,

Thank you for your answer.

So should the sentence read the same as “remember to freeze BatchNormalization layers too”?

Going through the Keras documentation for BatchNormalization helped me understand more of batch normalization.

However, I think this is a case where the Keras documentation of the implementation actually helps by explaining what batch normalisation is, more than how it is implemented by Keras, which could be done in the course. (maybe a reading?)
i.e. Having read the documentation the “statistics” in the sentence I reference are gamma and beta.

As this are the weights for BatchNormalization (from my keras installation):

<tf.Variable 'gamma:0' shape=(4,) dtype=float32, numpy=array([1., 1., 1., 1.], dtype=float32)>
<tf.Variable 'beta:0' shape=(4,) dtype=float32, numpy=array([0., 0., 0., 0.], dtype=float32)>
<tf.Variable 'moving_mean:0' shape=(4,) dtype=float32, numpy=array([0., 0., 0., 0.], dtype=float32)>
<tf.Variable 'moving_variance:0' shape=(4,) dtype=float32, numpy=array([1., 1., 1., 1.], dtype=float32)>

Excerpt from documentation:

During training (i.e. when using fit() or when calling the layer/model with the argument training=True), the layer normalizes its output using the mean and standard deviation of the current batch of inputs. That is to say, for each channel being normalized, the layer returns gamma * (batch - mean(batch)) / sqrt(var(batch) + epsilon) + beta, where:

epsilon is small constant (configurable as part of the constructor arguments)
gamma is a learned scaling factor (initialized as 1), which can be disabled by passing scale=False to the constructor.
beta is a learned offset factor (initialized as 0), which can be disabled by passing center=False to the constructor.
During inference (i.e. when using evaluate() or predict() or when calling the layer/model with the argument training=False (which is the default), the layer normalizes its output using a moving average of the mean and standard deviation of the batches it has seen during training. That is to say, it returns gamma * (batch - self.moving_mean) / sqrt(self.moving_var+epsilon) + beta.

self.moving_mean and self.moving_var are non-trainable variables that are updated each time the layer in called in training mode, as such:

moving_mean = moving_mean * momentum + mean(batch) * (1 - momentum)
moving_var = moving_var * momentum + var(batch) * (1 - momentum)
As such, the layer will only normalize its inputs during inference after having been trained on data that has similar statistics as the inference data.

Yes, maybe that would be a clearer way to say it.

Well, to go “full nerd” here, my interpretation is that the statistics are the moving_mean and moving_variance. The gamma and beta are parameters you use to modify those statistics by normalizing the inputs.

That’s an important point to emphasize: BatchNorm can train the gamma and beta values even when the overall model is being run in “inference” mode (no training) if you set that training parameter to be True. In other words, the purpose of the training parameter is specifically to control whether the BN parameters are being recalculated on the fly. When it is False, BN uses the remembered values of gamma and beta that were trained the last time the training was run.

I think you are correct.

From the documentation it seems that those statics are not used when training = False anyways so, from the standpoint of the sentence in the title, there is no difference between trainable and non-trainable variables in BatchNormalization.

I guess it (the sentence in the title) would make sense if the moving-mean and moving-variance needed a separate flag to gamma and beta, so you actually had to remember freezing those.