I have few questions
-
Are we finetuning the base LM using next token generation for instruct+input+answer or only answer ? If so while inferencing we don’t always provide the instruction template but though the model generates outputs.
-
Typically how many instruction answer pairs are required for better performance ( for instance gpt2 model)
-
I have worked on fine tuning gpt2 variants (gpt2 and gpt2-large) for instruction following. I have used a dataset of size 52k (alpaca dataset) and have done necessary preprocessing and data filtering. But what I found after training it on few epochs is that the model performs worst than the base model. It keeps on generating series of commas or full stops. I have used all the hyperparameters correctly.
The results were same even with parameter efficient fine tuning.
Here are the details about my training.
num_epochs = 5
warmup_steps = 1000
learning_rate = tried with 1e-3,1e-5,5e-5 but the results were same
weight_decay= 0.01
Can anyone please help me with this
Thanks in advance