Advanced LLM evaluation techniques

Christian_Simonis · July 25, 2023, 5:52am

Welcome to the community, @myacoob!
BLEU and ROUGE an n-gram metrics which are definitely powerful, but they are not the silver bullet for everything as you pointed out.

Here you find a find a nice paper on the evaluation of word embeddings. Feel free to take a look at it:

Additionally: it might also be a good strategy to get familiar with standard data science evaluation metrics like Kullback–Leibler divergence - sometimes also called relative entropy. It is often used in NLP, helping to compare how similar distributions are, see also this thread: Measurement of diversity - #4 by Christian_Simonis. Good thing is: you will explore that in week 3 of the LLM course.

Hope that helps!

Best regards
Christian

Topic		Replies	Views
Metrics QA LLMs Generative AI with Large Language Models week-2	1	356	October 30, 2023
Week 2 - Fine-tuning and LLM Evaluation in practice Generative AI with Large Language Models week-2	1	428	July 27, 2023
ROUGE and BLEU metrics Generative AI with Large Language Models week-2	4	520	September 18, 2024
How to evaluate In-Context learning/inferencing LLMS (ChatGPT) Generative AI with Large Language Models week-1	5	665	July 13, 2023
Are BLEU and ROUGE intrinsic or extrinsic evaluation measures? NLP with Attention Models week-1	4	728	April 11, 2023

Advanced LLM evaluation techniques

Related topics