Hi,
I’m trying to get clarity on multi-task fine-tuning. The videos refer to very generic tasks like translation or summarization. But lets say I’m building a Poker app to basic ask Poker questions like. Player 1 has these cards , player 2 has these cards. These are the community cards. Who wind the hand etc. Now will the multiple poker rules( Straight, Flush,Full House etc) be interpreted as inputs for Multi task fine-tuning OR will all that fall under one task hence considered as single-task fine-tuning.
My next question is about benchmarks: It seems like the ROGUE benchmark wouldn’t apply to a task like code gen since the variable names will never match and so how to you evaluate for accuracy. We need to measure the objective of that for loop or while loop correct?
For the first question, if you want to fine-tune a model to perform multiple tasks such as identifying different types of poker hands, then each task can be considered a separate task for multi-task fine-tuning. However, if you want the model to perform a single task of identifying the winning hand, then it would be considered single-task fine-tuning.
Regarding your second question, you are correct that the ROUGE benchmark may not be suitable for evaluating code generation tasks. In such cases, other metrics like BLEU or METEOR can be used. However, it’s important to note that these metrics may not be perfect for evaluating the accuracy of code generation as they do not consider the correctness of the generated code. It may be necessary to evaluate the output manually to ensure the generated code is correct.
Thanks for your response Atharva. Quick followup question:If my goal is build a Q&A type of app that amongst other things can pick winners then I would implement it as a multi-task fine-tuning. For e.g I could stream the transcript of a poker video and ask how many full houses was it able to identify etc. Thanks for your response in advance.
-Satish
I’m guessing that your question is if multi-task-fine-tuning might be a possible approach for a Q&A app or not. Implementing your Q&A app as a multi-task fine-tuning is a possible approach. This method can be useful for training your app to perform multiple tasks or identify different types of information. However, keep in mind that there may be other approaches you can take depending on the specific requirements of your app.
Hi @Satish1
ROGUE benchmark
The ROGUE benchmark is a metric that is used to evaluate the performance of text summarization models. It measures the overlap between the generated summary and the reference summary. However, the ROGUE benchmark is not suitable for tasks like code gen, where the variable names will never match.
In this case, it is better to use a benchmark that measures the objective of the code, such as the accuracy of the for loop or while loop. For example, you could use a benchmark that measures the number of correct outputs produced by the code
In the case of the poker app, the multiple poker rules could be interpreted as inputs for multi-task fine-tuning. This would allow the model to learn the relationships between the different rules, and it would be able to answer more complex poker questions. For example, the model could be trained to answer questions like “Who will win the hand?” or “What is the best play?”