Just trying to get some better understanding here.
Andrew is reasoning about using a cat-dataset pretrained NN to do transfer learning for a radiology-dataset. I do understand its just an example but started thinking about what is being transferred. I guest it about determining how much of the front-part of the network you need. Where the generic parts of image classification is. Is it mostly the actual network architecture that is valuable rather than the the pre-trained weight of each layer? Or a combination of both I guess? Andrew talked about speed of training and I guess finding a better starting point than simply initializing randomly is favourable?. Then why cant transfer learning be leveraged to some extent when training on the same dataset when searching for hyperparameters? Shouldn’t the weights from a previous run make sense to reuse to speed up learning there as well?
Reading about the “The LR (learning rate) range test” they adopt a warm-restart which sounds like a version of “transfer learning” in between iterations.
When you use “transfer learning”, it’s not just the architecture of the network: you actually use the learned weights. The fundamental idea is that the early layers of the network are learning to recognize more generic “features”, e.g. edges, curves and the like. Then the later layers can specialize into recognizing more specific things like a cat’s ear.
In the case of hyperparameter tuning, whether you can use the previously learned weights kind of depends on which hyperparameters you are talking about. If it’s the number of neurons output from any of the hidden layers or the activation functions in the hidden layers, then the previous weights are not very relevant, right?
Please, if someone could help me to clarify some questions for understanding…
I got the point of taking advantage from the pre-trained earlier layers for low level features, but when considering the possibility of full update of the model that updates all weights and bias (this possibility is mentioned in the class), what are the points:
take advantage of the model architecture?
the current weights and bias instead of random values as the starting point for optimization (loss/cost)?
When you do transfer learning, it is expected that if you add any additional layers to specialize the network to your particular data, then you have to do additional training at least of the new layers that you added. Then I think your question is (if I can rephrase a little just to make sure we are talking about the same thing) how do you decide whether you also need to do additional training on the earlier layers of the network?
I should start with a big disclaimer here: I have never actually tried to apply Transfer Learning on my own to a new problem, so all I know is just based on what I’ve heard Prof Ng say in these courses. So please take the following ideas with appropriate dosage of salt …
I think the fundamental question boils down to whether your particular input data is different in any significant way with respect to low level features compared to the original training corpus that was used to create the model you are starting with. Normally one would assume that the point of starting with a pretrained model is that it was trained on a large and pretty general dataset, so it would usually not be necessary to do any additional training on the early layers. E.g. one assumes that the training corpus covers the same image types (RGB, PNG or …) and that the corpus includes daylight, indoor, color, greyscale and black and white images (if your data includes b & w or greyscale) with the same general scale and format and so forth.
But the top level rule is always: do what works, right? If you try keeping the early layers frozen and training only your added layers or the layers past some point, then do you get accuracy results that are good enough for your purposes? If not, then one thing to consider might be trying additional training on the earlier layers of the network as well. Of course you would start with the pretrained weights for those layers rather than starting from scratch. If you start from scratch, then what is the real point other than starting with an architecture that is known to work. In other words, just copying the network architecture and training completely from scratch is not technically “transfer learning”. It’s just copying another architecture … That’s a perfectly valid and very common thing to do (as Prof Ng discussed back in Course 2), but just not quite the same as “transfer learning”.
I have a similar question about the same lecture.
What if we want to change the architecture of the neural network, or what do we have to do if the input layers have different lengths when we do the transfer learning task ( e.g. the images in cat recognition dims are 128x128, and x-ray images dims are 256x512.). in this case do we just replace the input layer and what if the next hidden layer length depends on the input length? I’m wondering if we can adapt
to these kinds of changes in transfer learning or starting from scratch is the way to go?
If the dimensions or image type (RGB, CMYK, Greyscale …) of the input change, then all bets are off: you are starting from scratch again and Transfer Learning is not applicable. A model is trained on a particular input format. There is no way to change that without fundamentally changing the model: even the shapes of the weight matrices in the first layer are not the same, right? So there is no way to apply the original network to the new problem.
One thing you could consider to get around this would be to see if there is some kind of image preprocessing steps you could do to convert your new inputs to the same format as the images on which the original network was trained. The fundamental issue remains: you can change the output layers of a network and still do Transfer Learning, but the early layers need to be the same to apply that strategy.