Evaluation methods for a specific task

marouane_khoukh · July 20, 2023, 1:20pm

I am fine-tuning LLMs on a specific task, but this specific task is not very related to either summarisation or translation…etc and to my understanding, I think that the benchmarks are evaluating tools for general LLMs as LLaMa, Falcon-b… etc which serves multiple tasks. I want to ask if there is an evaluating techniques for specific tasks fine-tuned LLMs and how can I know if my model is doing better on this specific task without judging by myself only throughout the outputs.

Juan_Olano · July 20, 2023, 3:08pm

I’d like to ask:

What type of fine tuning are you doing?
What is your task? is it classification? other?

Based on that I can probably help.

marouane_khoukh · July 20, 2023, 4:14pm

Hi! Thank you for your reply, my task is basically a generation of funny captions, I have a dataset that contains funny captions, I am fine-tuning LLMs to generate funny captions based on that dataset, but I’m only evaluating the output by myself, It would be great if there are some suggestions from your side.

Juan_Olano · July 20, 2023, 6:03pm

Wow! Interesting task.

“Funny” is very subjective. Unfortunately I don’t have a hint to provide other than the manual supervision.

I have worked extensively in generation of summaries and, while the LLMs do a great job, there’s always a missing key component in the summary that sometimes is the whole key of the document. I have worked in fine tuning my prompts adding edge cases, but still there are new cases.

I find your task somewhat similar to my task as far as measuring the performance. I have not found a way yet to measure the quality of the outputs I create. In fact, when people read the outputs, most say that they are great, but then there’s the keen reader who spots a gap that turns out to be fundamental.

In the "funny’ world, I find probably the same challenge.

Topic		Replies	Views
Week 2 - Fine-tuning and LLM Evaluation in practice Generative AI with Large Language Models week-module-2	1	431	July 27, 2023
Benchmarking accuracy of various large language models AI Discussions ai-discussions	2	57	August 7, 2023
Task for Instruction Fine Tuning Generative AI with Large Language Models week-module-2	4	460	August 12, 2023
Clarification on Multi task fine tuning Generative AI with Large Language Models week-module-2	4	496	August 11, 2023
Generative AI with Large Language Models Week 2 Multitask vs. Single task fine-tuning Generative AI with Large Language Models week-module-2	1	401	March 15, 2024

Evaluation methods for a specific task

Related topics