When Transfer Learning how do you tune Hyperparamters

How do you change the learning rate for example, do you change it for every layer or not for the frozen ones.

When a layer is frozen, it learns nothing. The weights correspond to its training weights before it was frozen.
When you unfreeze a layer, it learns as per the learning rate you specify when compiling the model.
Considering the mobilenet example, if you freeze all layers and build a model with the custom Dense layer with 1 neuron, only the final layer will learn weights.
As you unfreeze, starting from deeper layers of the network, those unfrozen layers will also start adjusting weights to better fit the training data.

Thenk you for you answere. But how do I change the Hyperparameters, when I am not satisfied with my accuracy

See this page and start with grid search.

Balaji has given you a good link to go deeper on that question. The one high level point also worth explicitly making is that you have to be a bit careful about what you change when you are doing Transfer Learning. For example, if you change the number of layers in the early part of the network or anything else about the architecture of the network in a given layer, you (by definition) lose any value of the transfer learning for that layer and any later layers of the network: you have to retrain from scratch, so what is the real point of Transfer Learning in that case? So if you want to preserve the value of the training inherent in the original model, you can only change things that don’t affect the architecture of the network up to the point where you “unfreeze” and add custom layers that specialize the solution to your particular case. So it is only a subset of all “hyperparameters” that can be treated as plastic in that context.

Thank you. So I can change every Hyperparamter when transfer Learning as long as it doesn’t change the architecture? For example if you overfit to the train set can you simply add dropout to all layers?

Ahh, well, let’s think a little more carefully about this. That’s a pretty subtle question. Does dropout regularization at a given layer change the architecture? I would guess that would be ok, but then the whole point is you would be doing incremental training at that layer (and beyond), right? But I think the previously learned values of the parameters would still be valid as a starting point for that further training, as opposed to needing to be retrained from random initialization.

Note that other forms of regularization like L2 are more clear cut in this respect: since it is applied to the cost function after the output layer, it would not invalidate any of the previously trained weights and would just affect whatever incremental training you are applying from the “unfreeze” point forward.

But note that other “per layer” things than the number of input and output neurons are part of the architecture. E.g. you can’t change the activation function at a given layer without invalidating the previous training.

With a little more thought, maybe that last statement I made about changing activation functions could be elaborated a bit more:

Clearly if you change the activation function in a given layer, that means you are changing the architectural definition of that layer. So you clearly need to further train the weights in that layer and all subsequent layers of the network. But when you do that training, it is still an interesting question whether it would make sense to start from scratch (randomly initialize the weights again) or start from whatever the pre-existing weights are. Maybe it’s valid just to consider those as just as reasonable as a starting point as random weights. Maybe there is incremental value or worst case you the training will take just as long as it would have with random reinitialization. Just on general principles, I would guess that there is probably no universal “one size fits all” answer to a question like that. The answer is most likely “it depends”, which is equivalent to saying “I don’t know” :nerd_face: …

But the one thing that is clear is that changing the activation completely invalidates everything past that point, since the inputs to the next layer are completely different. So maybe the incremental value of what happens with the starting point of the training at the one layer is meaningless in the bigger picture.

1 Like

That really helped me to understand transfer learning better. Thank you so much for your detailted and fast answers.