Task for Instruction Fine Tuning

In the original FLAN paper, they used data sets for specific tasks to do the instruction fine-tuning, but then they used and evaluated the resulting model on a different task not seen during training.

My understanding is that:

  1. They did this to be able to have an unbiased evaluation of the zero-shot capabilities of the resulting model
  2. For our practical use cases, it’s ok (recommended?) to fine tune with a dataset corresponding to our task?

Is my understanding correct?
Thanks.

The different task they evaluated the model has to be of the same nature as the task it was fine tuned (or what we call the same distribution) but not the same data.

Always you need to fine tune with a dataset corresponding to your task but you perform evaluation on unseen data of similar distribution, otherwise how can the model learn this task.

I’m still confused, the paper seems to say something different:

@Yacine_Mazari

This is my understanding of this case:

The goal is to have the model be more robust in multiple tasks - to be able to generalize better, not just on a single task but in as many as possible.

For this, FLAN does instruct tuning using a dataset that combines different types of instructions like: translation, summarization, q/a, sentiment analysis, etc. They built a dataset that includes, in clusters, instructions from all these (and more) tasks.

At the end of the tuning, they expect the model to have become better at generalizing. For this reason, they want to test it in some tasks that were not seen during training.

Lets suppose that we have these tasks: translation, summarization, q/a, sentiment analysis
Lets also suppose that we build a dataset that includes instructions for the first 3 tasks in the list above.
After the fine tuning, we want to see if the model actually learned and became better at generalizations. So we do a test using ‘sentiment analysis’, which was not included in the fine tuning. Hopefully we will see a good result.

The paper mentions that this works well in rather large models, while it can cause decreased performance in smaller models.

Now to your 2nd question:

For our practical use cases, where we, well, I, don’t have access to computing power to fine tune a model bigger than 20B params, meaning I have to deal only with smaller models, those equal or under 7B params, for these cases, my recommendation is: fine-tune on a single task.

In fact, as the paper says and as we annotated above, doing multi-task training on smaller models is counter-productive. So: Lets do fine tuning with datasets corresponding to our single task.

Does it make sense?

Thoughts?

1 Like

Or in bigger models instead of doing fine tuning one can use the lora add ons (the trainable matrices as add ons instead of changing model weights).