@Yacine_Mazari
This is my understanding of this case:
The goal is to have the model be more robust in multiple tasks - to be able to generalize better, not just on a single task but in as many as possible.
For this, FLAN does instruct tuning using a dataset that combines different types of instructions like: translation, summarization, q/a, sentiment analysis, etc. They built a dataset that includes, in clusters, instructions from all these (and more) tasks.
At the end of the tuning, they expect the model to have become better at generalizing. For this reason, they want to test it in some tasks that were not seen during training.
Lets suppose that we have these tasks: translation, summarization, q/a, sentiment analysis
Lets also suppose that we build a dataset that includes instructions for the first 3 tasks in the list above.
After the fine tuning, we want to see if the model actually learned and became better at generalizations. So we do a test using ‘sentiment analysis’, which was not included in the fine tuning. Hopefully we will see a good result.
The paper mentions that this works well in rather large models, while it can cause decreased performance in smaller models.
Now to your 2nd question:
For our practical use cases, where we, well, I, don’t have access to computing power to fine tune a model bigger than 20B params, meaning I have to deal only with smaller models, those equal or under 7B params, for these cases, my recommendation is: fine-tune on a single task.
In fact, as the paper says and as we annotated above, doing multi-task training on smaller models is counter-productive. So: Lets do fine tuning with datasets corresponding to our single task.
Does it make sense?
Thoughts?