I understand what is transfer learning and how it works. The point I’m missed with is how a pre-trained model for cat images classification is used to classify X-rays?! I mean the input data here is completely different and thus, as I guess, it wouldn’t be helpful I think. Another point is that the weight initialization is random in the first place. How does random weight initialization help? I thought as being random it would be helpful only for the current input not for another input.
Can you be more specific about where these statements are that you are asking about? Is it something Prof Ng says in one of the lectures? If so, please give us the lecture and the time offset.
But just in general, here are some thoughts:
The random initialization of weights (parameters) only happens before any training at all has taken place, right? This is needed for “symmetry breaking”. And that is for the initial training of the pre-trained model that is part of the input to Transfer Learning. The whole point of Transfer Learning is that we are starting with a pre-trained model. So the “initialization” for the Transfer Learning phase (if it makes sense to even call it that) is to set the weights to the trained weights of that pre-trained model. There is nothing random about it. Then we may do additional training to specialize the general model to our specific dataset. That training may involve all the layers of the network or just the layers past some particular point that we choose including (optionally) additional layers that we add to the pre-existing architecture.
I think you’re right that it probably would not be useful to start with an image classifier trained to recognize cats in RGB images, if your real goal is to analyze medical images of some sort. But there may be some value in starting with the first few layers of a general “object classifier” network since early layers of a network like that are trained to recognize low level features of images (lines, curves, edges …) which may be common to all images. But the later layers that are specialized to recognize, say, a cat’s ear, are not useful.
One other thing to note is that if we actually add new layers on the end of the pre-trained model that we are going to train with our specific data, then we would use random initialization when we start the training of those new layers. Of course that applies only to the brand new added layers: for the pre-trained layers, we do not “reinitialize”. That’s the point: we’re taking advantage of all the previous training that has already been done.
Prof Ng does explicitly describe the scenario that I think you are asking about. It is absolutely one of the legitimate methods to remove the existing pretrained softmax layer and then add our own new softmax layer or perhaps several additional new layers ending in our softmax. Then we would do additional training of just our newly added layers to specialize them to our dataset. If you missed this in the lectures, then it would be a good idea to go back and watch them again. You can find the relevant places by using the “interactive transcript” feature: just search for some terms in the transcript and then click to start the lecture at that point.
Prof Ng also discusses the fact that in the general case, we may also need to “unfreeze” the existing network at some existing layer earlier than the new layers that we add to do our specific classification and do additional training of those later pre-existing layers as well as our newly added layers. There is no single recipe that works in all cases: we have to run some experiments to see what is the best solution in any given particular case.