We are trying to build a domain specific llm like how llm was specialized for protein folding as in the ProLLaMA (https://arxiv.org/pdf/2112.08654) use case.
We want to do a continual learning on a base pretrained llm on sequences that would occur in customers transactions and then later stage do a finetuning using instruction prompts.
My questions are regarding the first step where we need to do a specialized pretrained model using continual learning.
How should we look at data preparation for this? How can i set up training data? because the instruction prompts that we would use for finetuning seem to be more language based where we give an instruction and the input and output for the model to complete that instruction.
I’m confused between giving sequences as
customer1 - merchant a, merchant b, merchant c …
customer2 - merchant d, merchant c, merchant x
Or should be more in line of how llm are trained with each next token like
merchant a,
merchant a, merchant b
merchant a, merchant b, merchant c
merchant d,
merchant d, merchant c
…
Or is this not a correct approach at all?
Please let me kno if i need to give any other details on this.