Why not a semantic metric?

huttarl · October 28, 2025, 3:39pm

This is not a question about a lab, but about the concepts in the course. In the “Model evaluation” video, ROUGE and BLEU are described. They score responses based on common tokens or n-grams of various sizes between a model’s responses and baseline human responses. But as the video points out, these evaluation techniques are very surfacy and easy to trick, and don’t take semantics into account at all. For example, if the baseline response is “It’s cold today,” then “It’s cool today” and “It’s hot today” (and “It’s Wednesday today” and “It’s not today”) would all have the same ROUGE and BLEU scores, even though “It’s cool today” is much closer to the baseline. (Am I understanding that correctly?)

So my question is, why don’t people use a metric that takes the meanings of words into account? I mean, we already have vector embeddings for these tokens, which can tell us automatically that “cool” is closer in meaning to “cold” than “hot” is. Wouldn’t that give us a much more helpful metric?

I’m not knowledgeable in the field, just taking this course, and wondering. Maybe there are semantic metrics out there, but for some reason they’re not as useful or practical as one might expect, so they don’t come up in a discussion of model evaluation?

gent.spah · October 29, 2025, 6:06am

Metrics based on embeddings exist and are used in recent NLP research. They just weren’t in the “classic” toolbox yet when BLEU/ROUGE became standard. But they do bring also some challenges with them:

Bias toward pretrained models:

If you use BERTScore, you’re essentially relying on BERT’s embedding space. That means your evaluation depends on the biases and limitations of BERT

Task-specific reliability:

In machine translation, summarization, or dialogue, different nuances matter, and a general embedding similarity might not match human judgment perfectly.

Reproducibility and standards:

BLEU/ROUGE are cheap, deterministic, and universally understood. Semantic metrics are more complex and can vary depending on which pretrained model you use.

huttarl · November 8, 2025, 7:19pm

Thanks, this is helpful.

It just seems strange that the lesson would

Introduce a motivating example in which comparing word-by-word (or token-by-token?) fails to capture critical semantic differences;
Describe ROUGE & BLEU, which are more sophisticated word-by-word comparison algorithms but still don’t take semantic similarities and differences into account; and
Fail to mention that there are actually metrics that attempt to address the key problem described in the motivating example.

Topic		Replies	Views
Advanced LLM evaluation techniques Generative AI with Large Language Models week-module-2	2	882	July 28, 2023
How do we solve the metric problem with "not" and synonyms? Generative AI with Large Language Models week-module-2	1	334	October 19, 2023
ROUGE and BLEU metrics Generative AI with Large Language Models week-module-2	4	585	September 18, 2024
What is major difference between BLEU score and BLEU modified NLP with Attention Models week-module-1	6	400	April 8, 2024
Are BLEU and ROUGE intrinsic or extrinsic evaluation measures? NLP with Attention Models week-module-1	4	827	April 11, 2023

Why not a semantic metric?

Related topics