Instruct Tuning LLMs

I have few questions

  1. Are we finetuning the base LM using next token generation for instruct+input+answer or only answer ? If so while inferencing we don’t always provide the instruction template but though the model generates outputs.

  2. Typically how many instruction answer pairs are required for better performance ( for instance gpt2 model)

  3. I have worked on fine tuning gpt2 variants (gpt2 and gpt2-large) for instruction following. I have used a dataset of size 52k (alpaca dataset) and have done necessary preprocessing and data filtering. But what I found after training it on few epochs is that the model performs worst than the base model. It keeps on generating series of commas or full stops. I have used all the hyperparameters correctly.
    The results were same even with parameter efficient fine tuning.

Here are the details about my training.
num_epochs = 5
warmup_steps = 1000
learning_rate = tried with 1e-3,1e-5,5e-5 but the results were same
weight_decay= 0.01

Can anyone please help me with this

Thanks in advance