Is Pre-training for Adaptation as explained in the Week 1 with BloombergGPT different from fine-tuning?
Even the BloombergGPT was trained on 49% public data (non-financial) and 51% public+private data (finance specific). With that why don’t we call it fine-tuning vs. calling it pre-training.
We pre-train a base model out, and we adapt the base model to a particular use case through fine-tuning. When we pre-train, we can start from no model. When we fine-tune, we start from a pre-trained base model.
Yes, that’s how I understand too. In the case of BloombergGPT, was their a foundational model that was pre-trained OR a new model written from scratch?
Most probably BloombergGPT was written from scratched and pre-trained with the vast amount of information that Bloomberg has access to.
You see, the thing with these models is that it takes really little lines of code when compared with traditional software. The challenge with the models is:
The data. You need huge amounts of data to train a 50-billion paramters model.
The compute resources to train them. You need lots and lots of GPUs to train a billions-parameters model.
They used over 700 billion tokens for this training.