Q1. What “aviod keeping track of statistics” means actually?
I think it as when we use BN we calculate the means and variances of each mini-batch and these values will be stored in memory, so explicitly writing “x = base_model(x, training=False)” means we are not gonna save those values!
Is it right answer?
Q2. When we make the fine tuning model with previous “model2” which we freezed parameters, we unfreeze some later layers. Then the default setting in previous model I think is
“base_model.trainable = False”, “x = base_model(x, training=False).”
And we unfreeze later layers in fine tuning model
“layer.trainable=True”,
then what about another? (x=base_model(x, training=True?)
I mean that when we train the fine tunning model,
we don’t care about the statistics in activations?
Or it just set the activations in later unfreezed layers with training=True?
I think we have to store it in memory, because if we update the batch norm parameters in later layer, the backpropagation of BN needs the “cache” of the computation which is done in forward pass. (of course including “statistics”)
(The activation I said here is (m, d) shape that “m” means the number of training examples in batch which is exactly (mhw) in CNN and “d” means “C” (channel))
The other thing that you need to be careful of here is that there are two different kinds of trainable parameters:
- The normal weight and bias values.
- The mean and variance parameters for Batch Normalization.
The model.trainable
flag controls whether you are doing back prop and updates on the weight and bias values.
The training flag is independent and controls whether BatchNorm updates its mean and variance values or uses stored values that were previously computed. Those are the “statistics” that the comment is talking about. Note that even when you are not training the network and only using it in “inference mode” to make predictions, you still have the option to enable the more dynamic behavior of BatchNorm. But doing that can cause some slightly odd results: the predictions you get on a given sample may vary a bit depending on what other samples you are bundling with it in a given inference call.
The general TF/Keras docs are not very clear on the true meaning of the various flags discussed above, but here’s a more detailed post from F. Chollet (the creator of Keras), which is basically a chapter of his book on Keras and discusses all this in a much clearer form. But be warned that it’s not a 2 minute read. 
Thank you!
Now I see that the difference of model.trainable flag and inference / train mode flag.
And batch norm is tied with those two concept!