I am trying to understand how LLM models are evaluated. Simple metrics like ROUGE OR BLUE use n-gram based logic to compare how similar is the output to human baseline. These evaluation metrics depend on word to word similarity to produce comparison results. They however fail in situations where the meaning is similar but different sets of words are used. For example, “Mike really loves drinking tea.” and " Mike adores sipping tea" are similar in meaning but will score poorly using ROUGE Or BLUE. Are there advanced evaluation metrics to solve this problem?
Welcome to the community, @myacoob!
BLEU and ROUGE an n-gram metrics which are definitely powerful, but they are not the silver bullet for everything as you pointed out.
Here you find a find a nice paper on the evaluation of word embeddings. Feel free to take a look at it:
Additionally: it might also be a good strategy to get familiar with standard data science evaluation metrics like Kullback–Leibler divergence - sometimes also called relative entropy. It is often used in NLP, helping to compare how similar distributions are, see also this thread: Measurement of diversity - #4 by Christian_Simonis. Good thing is: you will explore that in week 3 of the LLM course.
Hope that helps!
Best regards
Christian
@Christian_Simonis Thanks for pointing me in the right direction. Delving into evaluation techniques, I learnt about using BERTscore, SAS. Along with, I read something about how MMLU evaluates LLMs. And then at the end was blown away when introduced the idea that LLMs can be used to evaluate LLMs!(I do understand there are some limitations. But the idea is cool)