Task for Instruction Fine Tuning

Yacine_Mazari · August 12, 2023, 9:16am

In the original FLAN paper, they used data sets for specific tasks to do the instruction fine-tuning, but then they used and evaluated the resulting model on a different task not seen during training.

My understanding is that:

They did this to be able to have an unbiased evaluation of the zero-shot capabilities of the resulting model
For our practical use cases, it’s ok (recommended?) to fine tune with a dataset corresponding to our task?

Is my understanding correct?
Thanks.

gent.spah · August 12, 2023, 11:43am

The different task they evaluated the model has to be of the same nature as the task it was fine tuned (or what we call the same distribution) but not the same data.

Always you need to fine tune with a dataset corresponding to your task but you perform evaluation on unseen data of similar distribution, otherwise how can the model learn this task.

Yacine_Mazari · August 12, 2023, 12:16pm

I’m still confused, the paper seems to say something different:

Juan_Olano · August 12, 2023, 3:11pm

@Yacine_Mazari

This is my understanding of this case:

The goal is to have the model be more robust in multiple tasks - to be able to generalize better, not just on a single task but in as many as possible.

For this, FLAN does instruct tuning using a dataset that combines different types of instructions like: translation, summarization, q/a, sentiment analysis, etc. They built a dataset that includes, in clusters, instructions from all these (and more) tasks.

At the end of the tuning, they expect the model to have become better at generalizing. For this reason, they want to test it in some tasks that were not seen during training.

Lets suppose that we have these tasks: translation, summarization, q/a, sentiment analysis
Lets also suppose that we build a dataset that includes instructions for the first 3 tasks in the list above.
After the fine tuning, we want to see if the model actually learned and became better at generalizations. So we do a test using ‘sentiment analysis’, which was not included in the fine tuning. Hopefully we will see a good result.

The paper mentions that this works well in rather large models, while it can cause decreased performance in smaller models.

Now to your 2nd question:

For our practical use cases, where we, well, I, don’t have access to computing power to fine tune a model bigger than 20B params, meaning I have to deal only with smaller models, those equal or under 7B params, for these cases, my recommendation is: fine-tune on a single task.

In fact, as the paper says and as we annotated above, doing multi-task training on smaller models is counter-productive. So: Lets do fine tuning with datasets corresponding to our single task.

Does it make sense?

Thoughts?

gent.spah · August 12, 2023, 5:34pm

Or in bigger models instead of doing fine tuning one can use the lora add ons (the trainable matrices as add ons instead of changing model weights).

Topic		Replies	Views
Evaluation methods for a specific task Generative AI with Large Language Models week-module-2	3	436	July 20, 2023
Generative AI with Large Language Models Week 2 Multitask vs. Single task fine-tuning Generative AI with Large Language Models week-module-2	1	404	March 15, 2024
Week 2 - Fine-tuning and LLM Evaluation in practice Generative AI with Large Language Models week-module-2	1	433	July 27, 2023
Instruct fine-tuning vs Vanilla fine-tuning Generative AI with Large Language Models week-module-2	5	1642	March 15, 2024
Should we use chain of thoughts prompts while instruction tuning the model Generative AI with Large Language Models week-module-3	4	696	July 15, 2023

Task for Instruction Fine Tuning

Related topics