Advanced LLM evaluation techniques

myacoob · July 24, 2023, 9:35pm

I am trying to understand how LLM models are evaluated. Simple metrics like ROUGE OR BLUE use n-gram based logic to compare how similar is the output to human baseline. These evaluation metrics depend on word to word similarity to produce comparison results. They however fail in situations where the meaning is similar but different sets of words are used. For example, “Mike really loves drinking tea.” and " Mike adores sipping tea" are similar in meaning but will score poorly using ROUGE Or BLUE. Are there advanced evaluation metrics to solve this problem?

Christian_Simonis · July 25, 2023, 5:52am

Welcome to the community, @myacoob!
BLEU and ROUGE an n-gram metrics which are definitely powerful, but they are not the silver bullet for everything as you pointed out.

Here you find a find a nice paper on the evaluation of word embeddings. Feel free to take a look at it:

Additionally: it might also be a good strategy to get familiar with standard data science evaluation metrics like Kullback–Leibler divergence - sometimes also called relative entropy. It is often used in NLP, helping to compare how similar distributions are, see also this thread: Measurement of diversity - #4 by Christian_Simonis. Good thing is: you will explore that in week 3 of the LLM course.

Hope that helps!

Best regards
Christian

myacoob · July 28, 2023, 4:00am

@Christian_Simonis Thanks for pointing me in the right direction. Delving into evaluation techniques, I learnt about using BERTscore, SAS. Along with, I read something about how MMLU evaluates LLMs. And then at the end was blown away when introduced the idea that LLMs can be used to evaluate LLMs!(I do understand there are some limitations. But the idea is cool)

Topic		Replies	Views
Metrics QA LLMs Generative AI with Large Language Models week-2	1	358	October 30, 2023
Week 2 - Fine-tuning and LLM Evaluation in practice Generative AI with Large Language Models week-2	1	429	July 27, 2023
ROUGE and BLEU metrics Generative AI with Large Language Models week-2	4	527	September 18, 2024
How to evaluate In-Context learning/inferencing LLMS (ChatGPT) Generative AI with Large Language Models week-1	5	668	July 13, 2023
Are BLEU and ROUGE intrinsic or extrinsic evaluation measures? NLP with Attention Models week-1	4	732	April 11, 2023

Advanced LLM evaluation techniques

Related topics