I was wondering, in case of task A and B (A->B) ,why transefer learning is more useful when A has a lot of data only, why can’t we use it always independent on task B amount of data as a better initialization for parameters (in case both A and B are related tasks)?
I knew prof Ng said that it may not hurt to use transfer learning in this case, but why isnt it always useful?
as far as I understand we need huge amout of data in general for neural network to work well and if we have only few amount of data for task B we use data from A to get a better initialization for the layers before re-training on B data.
@MustafaaShebl I might be a little careful with the use of the word ‘initialization’, because this also plays an important role in how you start the net (i.e. Xavier/He).
I think what it comes down to is our data distribution, and if those are somewhat a match, or close, in effect we can ‘bend’ the network.
Important also to note we are basically only doing this just before the inference stage, or at the tail end, so we are sort of finessing things a bit.
However, if our two (A/B) distributions are really different, or individually too small, obviously this won’t work.
There is little (or nothing) for one to ‘learn’ from another.
2 Likes
Yes, I agree with Anthony’s explanation here. The key point is the statistical relationship between the training data used to train model A and the training data you have for model B. If the training datasets for both are relatively small, then perhaps that increases the chances that the statistical correlations between the two datasets are weak. Of course I would also say that it’s possible in theory that the dataset for A is very large and still not statistically related to the dataset for B. So maybe the size is not really that useful as a decision criterion here: the real point is how statistically similar the two datasets are independent of their relative sizes. But it’s been several years since I actually listened to these lectures and it’s always worth listening again carefully to what Prof Ng says. I may be forgetting some of the subtleties.
Also note just as a general point that you have quite a bit of flexibility in how you actually implement Transfer Learning in any given case. It is a decision you will need to make how much of the pretrained network you “unfreeze” and retrain with your dataset. In some cases you may find it more helpful to retrain all layers of the base network with the dataset for your actual problem. In other cases if the A and B datasets have stronger correlation, it may not be necessary to unfreeze the earlier layers when you do the training on the B dataset. More hyperparameter choices to be made.
1 Like
@paulinpaloalto I would just add, it is not, I think, overly stressed here. But in my independent study of diffusion models I am like ‘Wait, I get it now’.
*or it is the difference between the forest and the trees.
1 Like
@paulinpaloalto @Nevermnd
So the more important is the distribution of tha data not the size, we are only using probabilty that larger data (of task A) is more likely to be correlated to smaller data (of task B).
Can I think of statistical correlation as what prof. Ng mentioned
“When transfer learning makes sense
• You have a lot more data for Task A than Task B.
• Task A and B have the same input x.
• Low level features from A could be helpful for learning B.”
As when these three conditions are satisified we can say that there is a large chance that A and B are correlated?
Also I thought of having a small amount of data in task A as overfitting to task A and not being able to extract general features of the task to generalize/help in task B, is that wrong?
Thanks in advance.
@MustafaaShebl @paulinpaloalto I have to defer to Paul on this, because he is the resident ‘math wiz’.
When I was in school I was taught you have to hit n=32 to get a standard normal.
I have no idea if such a threshold stands for neural nets ?
I too, look forward to Paul’s answer on this (because I don’t know).
*I can also add, so I don’t seem dumb, if you are working on much smaller dataset sizes, there are a lot of methods, SVM, SVD, Knn, even regression that probably work both faster and better. You’re Neural Net is like the extreme case. I have tons of data and don’t know what to do with it.
** I should also add, in the case of Nets, I am sure, no way that number can be so small.
1 Like
I should just say, sorry-- Not to ‘put you on the spot’ here Paul but you likely have better insight than I do (or I admitted I didn’t know), but I was curious about this/what your thinking might be too.
Best,
-A