Can you mix and match different types of data?

I’m looking into fine-tuning an LLM for a domain specific instruction finetuning project. However, there’s a lot of related concepts that aren’t necessarily suited in a QA format that I believe the model would need to get a good understanding in general of the topic.

My question is, does anyone know, if I mix and training data with non - prompt/answer type data and also feed the LLM reams of relevant background text/documents if that would produce undesired results.

I’m curious to know before I invest fair bit of time/money into creating finetuning data. Any input would be greatly appreciated!

Yes it might cause catastrophic forgetting of the original task, but there are techniques overcoming that, you should check the Generative AI course from AWS we have here to research some of those techniques.

Thanks, I’m actually just about 2/3rds of the way through the AWS Generative AI course.
Sounds like the approach would be to present the background information into chunked Q/A.

1 Like