ROUGE-L Calculation in the lecture : "Model Evaluation" of Week-2

charan_chinni · August 21, 2023, 4:56am

I have doubt regarding the calculation of the ROUGE-L metric in the attached slide. In my opinion , ROUGE-L Precision is the Longest common subsequence(LCS) of the human generated summary, machine generated summary divided by number of words in the machine generated summary. and similarly for ROUGE-L Recall is the LCS(human reference, machine generated) divided by number of words in human generated summary. Hence Longest common subsequence will be “It is cold outside” and hence the precision is 4/5 = 0.8, here denominator 5 : number of words in machine generated summary and similarly recall is 4/4 =1 , here denominator 4 : number of words in human generated summary. and finally the f1 score is harmonic mean of the precision and recall. Therefore
Precision : 0.8
Recall : 1
F1-Score : 0.889

Please correct me if I’m wrong!!

charan_chinni · August 22, 2023, 1:02am

Hello @rmwkwok ,
According to the definition of the Longest common subsequence, we need to check for the longest ordered set of tokens that appear in both sequences (not necessarily consecutively). Even from the original paper titled : ROUGE: a Package for Automatic Evaluation of Summaries the definition was given in the same way. Please check the attached image and the original paper link for ROUGE metric calculation( Microsoft Word - WAS2004-ROUGE-Package-Final-One-Column.doc) and hence it is correct to consider “It is cold outside” as the longest common subsequence.

Please correct me if I’m wrong!!

Charan

rmwkwok · August 22, 2023, 1:40am

Hello Charan,

I take back my previous post. I think you are right, unless they are discussing their own variant of rouge score. No luck finding someone uses that definition from a quick search.

Let me tag @chris.favila and see if he has any input to this.

Thanks for pointing this out and sharing the screenshot.

Cheers,
Raymond

joyjoycew · November 5, 2023, 5:00pm

Hi @charan_chinni , I have the same observation and thoughts as you did. In my opinion, the slide puts a wrong calculation for ROUGE-L score. I did my own calculation which gives me a precision of .8, a recall of 1, and f1score of 0.889, so I voted you.

@chris.favila Hey Chris, could you help validate our thoughts and help correct the slide if it is indeed an issue? Thanks!

Babji_Manohar_Erle · December 15, 2023, 11:20am

Wait, the paper DOES SAY say “strict increasing sequence”. And how did you took that as “ordered but not consecutive” ?
An index sequence - 2, 3, 4, 5 is a strict increasing and
An index sequence - 2, 4, 5, 8 is not a “strict” increasing. It is a randomly increasing.

Topic		Replies	Views
ROUGE score - what does it refer to - Recall / Precision / F1 or something else? Generative AI with Large Language Models week-module-2	4	419	August 9, 2024
Error in video's ROUGE-1 `precision` calculation Generative AI with Large Language Models week-module-2	2	17	February 10, 2025
C4_W1rouge1_similarity NLP with Attention Models week-module-1	5	407	September 4, 2024
C4W1-Exercise 6 - rouge1_similarity NLP with Attention Models week-module-1	1	33	September 4, 2024
UNQ_C8 - Rouge similarity NLP with Attention Models week-module-1	2	662	May 21, 2022

ROUGE-L Calculation in the lecture : "Model Evaluation" of Week-2

Related topics