I understand that training llms includes pretraining and then finetuning. I know the pretraining is conducted using a very large corpus of as much as hundreds of billions of tokens, and finetuning only uses a small fraction of the amount (i.e. maybe 1000 examples). From what I understand the training is done, in much the same way, through backpropogation. Why are the (say 500,000 tokens) from the 1000 examples having a bigger effect than the last 500,000 tokens, used in the pretraining phase?
Fine tuning typically adds new layers onto an existing model, to teach it a specific task.
I am taking the situation (in transformers) where the same architecture is used and all parameters are trained (full finetuning).
I see.
Your private messages were helpful. I see the finetuning is not having more effect. It is teaching the model to answer questions.