How do we solve the metric problem with "not" and synonyms?

In this week’s lecture, the instructor mentioned that “It is cold” and “It is not cold” differ by only one word, but very different in terms of the meaning. I could not find any solution to this problem in the lecture.
Another example in the lecture, “I like drinking coffee” and “I abhor shipping coffee”, their similarity does not seem to be evaluated properly by ROUGE or BLEU discussed in the lecture.

Is there some remedy to these evaluation issues, or do we admit them because such in practice (because automatic evaluation is difficult)?

Maybe you could use a new prompt asking the LLM to compare options to evaluate the result and propose a solution. Or give a score like 1 or 0 if the task was correctly achieved. It would cost much more than using a metric like ROUGE or BLEU.