According to the video on model evaluation, at the very end, it’s said that we can use the ROUGE score to evaluate summarization models and BLEU for translation tasks. However, for translation tasks, we will never have the same tokens in the completion as in the prompt by definition. Therefore, how we can evaluate the model performance using n-grams if the n-grams will be different in each language?
As far as I understand, BLEU will take the output and compare it with a human reference (or a reference defined by the architect). So it doesn’t matter that the input and output n-grams are different.
We aren’t comparing the tokens in the prompt and the completion. We are comparing the generated completion with the reference (human) text (completion). Hence, sentences in french will be compared with sentences in french only and not english.
I have a follow-up question with this regard. Since both ROUGE and BLUE are comparing the generated completion with the human reference, why in the lecture it is said that one is more fitting for text summarization (ROGUE) and the other for text translation (BLEU)? As both are proxies of how similar is the model-generated completion is to a human reference, not to the original input.
Thanks!