As title suggests, what’s a good rouge/bleu score to aim for? has there been any studies done to correlate the scores to how human perceive the output?
Rouge and Bleu scores are commonly used metrics for evaluating the quality of text generation models. While there is no specific score that can be considered “good” universally, higher scores generally indicate better performance.
As for the correlation between these scores and human perception, studies have shown that there is a positive correlation between higher Rouge and Bleu scores and human perception of quality. However, it’s important to note that these metrics only evaluate certain aspects of text quality, such as precision and recall, and may not capture more nuanced aspects of language use that humans may perceive as important.
Thank you for posting. Happy Learning!
To add to @Atharva_Divekar’s great and clear answer and as a means of documentation for future learners: Both scores go from 0 to 1, where 0 is no overlap and 1 is perfect match.
Another item to remember is:
Bleu is primarily used to grade the quality of text translations.
Rouge is primarily used to grade the quality of text summarizations.
Other uses may apply, but these are the primarily uses.
Thanks!
Juan