Why not a semantic metric?

This is not a question about a lab, but about the concepts in the course. In the “Model evaluation” video, ROUGE and BLEU are described. They score responses based on common tokens or n-grams of various sizes between a model’s responses and baseline human responses. But as the video points out, these evaluation techniques are very surfacy and easy to trick, and don’t take semantics into account at all. For example, if the baseline response is “It’s cold today,” then “It’s cool today” and “It’s hot today” (and “It’s Wednesday today” and “It’s not today”) would all have the same ROUGE and BLEU scores, even though “It’s cool today” is much closer to the baseline. (Am I understanding that correctly?)

So my question is, why don’t people use a metric that takes the meanings of words into account? I mean, we already have vector embeddings for these tokens, which can tell us automatically that “cool” is closer in meaning to “cold” than “hot” is. Wouldn’t that give us a much more helpful metric?

I’m not knowledgeable in the field, just taking this course, and wondering. Maybe there are semantic metrics out there, but for some reason they’re not as useful or practical as one might expect, so they don’t come up in a discussion of model evaluation?

Metrics based on embeddings exist and are used in recent NLP research. They just weren’t in the “classic” toolbox yet when BLEU/ROUGE became standard. But they do bring also some challenges with them:

Bias toward pretrained models:

If you use BERTScore, you’re essentially relying on BERT’s embedding space. That means your evaluation depends on the biases and limitations of BERT

Task-specific reliability:

In machine translation, summarization, or dialogue, different nuances matter, and a general embedding similarity might not match human judgment perfectly.

Reproducibility and standards:

BLEU/ROUGE are cheap, deterministic, and universally understood. Semantic metrics are more complex and can vary depending on which pretrained model you use.

1 Like

Thanks, this is helpful.

It just seems strange that the lesson would

  1. Introduce a motivating example in which comparing word-by-word (or token-by-token?) fails to capture critical semantic differences;
  2. Describe ROUGE & BLEU, which are more sophisticated word-by-word comparison algorithms but still don’t take semantic similarities and differences into account; and
  3. Fail to mention that there are actually metrics that attempt to address the key problem described in the motivating example.

:man_shrugging:

1 Like