I totally agree that the documentation on all this is a bit confusing. I don’t claim to fully understand all this either, but hopefully this will help and not make things worse ( ):
Note that the parameters of the model are a separate thing from Batch Normalization. If you read the TF docs for BatchNormalization, it says that when you run “fit()” it behaves differently than when you run inference, but in either case it also pays attention to the setting of the (BatchNorm specific) parameter training
. So I think the point here is that we actually are running “fit()” on the entire alpaca_model
, but we’ve set the base_model
section to be not trainable. So we are in “fit()” mode, but the only parameters that will be learned are in the new layers that we added to specialize the network for alpaca recognition. In the case of BatchNorm, it has its own separate set of values that it either uses from what it previously learned in training (the mean and variance of the inputs) or it computes those values on the fly even in “inference” mode if you set training = True
.
I guess in order to really understand all this, one approach would be to look at the TF source code and see if the BN logic is smart enough to look at the trainable
attribute on a per layer basis, but it sounds like they are saying here that it’s not that clever. Somehow it only has two “flags” that control its behavior: the overall setting of “fit()” versus not and the “per layer” BatchNorm parameter training
. At least that’s my interpretation. Which is probably worth what you paid for it. But I don’t have the time or energy to actually try looking at the TF code to understand this in more detail. If anyone else is motivated to do that, please share whatever you learn!