Evaluation methods for a specific task

I am fine-tuning LLMs on a specific task, but this specific task is not very related to either summarisation or translation…etc and to my understanding, I think that the benchmarks are evaluating tools for general LLMs as LLaMa, Falcon-b… etc which serves multiple tasks. I want to ask if there is an evaluating techniques for specific tasks fine-tuned LLMs and how can I know if my model is doing better on this specific task without judging by myself only throughout the outputs.

I’d like to ask:

  1. What type of fine tuning are you doing?

  2. What is your task? is it classification? other?

Based on that I can probably help.

1 Like

Hi! Thank you for your reply, my task is basically a generation of funny captions, I have a dataset that contains funny captions, I am fine-tuning LLMs to generate funny captions based on that dataset, but I’m only evaluating the output by myself, It would be great if there are some suggestions from your side.

Wow! Interesting task.

“Funny” is very subjective. Unfortunately I don’t have a hint to provide other than the manual supervision.

I have worked extensively in generation of summaries and, while the LLMs do a great job, there’s always a missing key component in the summary that sometimes is the whole key of the document. I have worked in fine tuning my prompts adding edge cases, but still there are new cases.

I find your task somewhat similar to my task as far as measuring the performance. I have not found a way yet to measure the quality of the outputs I create. In fact, when people read the outputs, most say that they are great, but then there’s the keen reader who spots a gap that turns out to be fundamental.

In the "funny’ world, I find probably the same challenge.