This is not a question about a lab, but about the concepts in the course. In the “Model evaluation” video, ROUGE and BLEU are described. They score responses based on common tokens or n-grams of various sizes between a model’s responses and baseline human responses. But as the video points out, these evaluation techniques are very surfacy and easy to trick, and don’t take semantics into account at all. For example, if the baseline response is “It’s cold today,” then “It’s cool today” and “It’s hot today” (and “It’s Wednesday today” and “It’s not today”) would all have the same ROUGE and BLEU scores, even though “It’s cool today” is much closer to the baseline. (Am I understanding that correctly?)
So my question is, why don’t people use a metric that takes the meanings of words into account? I mean, we already have vector embeddings for these tokens, which can tell us automatically that “cool” is closer in meaning to “cold” than “hot” is. Wouldn’t that give us a much more helpful metric?
I’m not knowledgeable in the field, just taking this course, and wondering. Maybe there are semantic metrics out there, but for some reason they’re not as useful or practical as one might expect, so they don’t come up in a discussion of model evaluation?