During the error analysis videos, when lots of synthetic data and low amount of target data is available, why pre-training on synthetic and fine-tuning on the target data is not mentioned as a possible approach? Sounds like the same problem mentioned in the transfer learning chapter.
Transfer learning works really well when there is a significant overlap at the level of characteristics between the pre-training (source) data and the target data. If the synthetic data only partially cover the characteristics of the target data, prior learning may not be ideal. Instead, methods such as data blending or augmentation may be more effective in such cases.
The objective of pretraining is to obtain “generic” parameters that capture something like generic features - for example, in LLM pretraining it’s assumed the data set provides enough information about the structure of language and grammar. Using synthetic data for this purpose may or may not work - “teacher forcing” is one of the main affected steps if the synthetic data has a lot of “hallucinations”.
There are no major concerns with fine-tuning on your target data set except the data size/variety - if you train too many epochs on a small data set or on a data set with limited variation, the adapter layers are likely to memorize the target data.