In Exercise 2 - alpaca_model, I was a little bit confused about having these 2 lines in the same code fragment:
# freeze the base model by making it non trainable
base_model.trainable = False
# set training to False to avoid keeping track of statistics in the batch norm layer
x = base_model(x, training=False)
I believe I understand what they both are doing but, can someone point me to where I can find some better documentation in reference to the ‘trainable model’ vs ‘training mode’?
I get that the first line freezes all the base_model parameters so they won’t get updated during training. The second line makes a prediction in inference mode, which makes layers like Dropout or BatchNormalization behave differently (i.e., use moving average instead of the average of the current batch, and not dropping any unit).
But I find it a bit confusing, I thought just by having the first line one wouldn’t need the second one. In which scenarios might someone want to do something like this? For instance, setting the base_model to not trainable but still perform dropout in one of its layers doesn’t seem like a good approach to me (maybe it helps regularize the layers added on top?).
Thanks in advance
I totally agree that the documentation on all this is a bit confusing. I don’t claim to fully understand all this either, but hopefully this will help and not make things worse ( ):
Note that the parameters of the model are a separate thing from Batch Normalization. If you read the TF docs for BatchNormalization, it says that when you run “fit()” it behaves differently than when you run inference, but in either case it also pays attention to the setting of the (BatchNorm specific) parameter
training. So I think the point here is that we actually are running “fit()” on the entire
alpaca_model, but we’ve set the
base_model section to be not trainable. So we are in “fit()” mode, but the only parameters that will be learned are in the new layers that we added to specialize the network for alpaca recognition. In the case of BatchNorm, it has its own separate set of values that it either uses from what it previously learned in training (the mean and variance of the inputs) or it computes those values on the fly even in “inference” mode if you set
training = True.
I guess in order to really understand all this, one approach would be to look at the TF source code and see if the BN logic is smart enough to look at the
trainable attribute on a per layer basis, but it sounds like they are saying here that it’s not that clever. Somehow it only has two “flags” that control its behavior: the overall setting of “fit()” versus not and the “per layer” BatchNorm parameter
training. At least that’s my interpretation. Which is probably worth what you paid for it. But I don’t have the time or energy to actually try looking at the TF code to understand this in more detail. If anyone else is motivated to do that, please share whatever you learn!