ROUGE and BLEU metrics


According to the video on model evaluation, at the very end, it’s said that we can use the ROUGE score to evaluate summarization models and BLEU for translation tasks. However, for translation tasks, we will never have the same tokens in the completion as in the prompt by definition. Therefore, how we can evaluate the model performance using n-grams if the n-grams will be different in each language?

Thank you!

As far as I understand, BLEU will take the output and compare it with a human reference (or a reference defined by the architect). So it doesn’t matter that the input and output n-grams are different.

We aren’t comparing the tokens in the prompt and the completion. We are comparing the generated completion with the reference (human) text (completion). Hence, sentences in french will be compared with sentences in french only and not english.

Oh, see. I was confused.