ROUGE-L Calculation in the lecture : "Model Evaluation" of Week-2

I have doubt regarding the calculation of the ROUGE-L metric in the attached slide. In my opinion , ROUGE-L Precision is the Longest common subsequence(LCS) of the human generated summary, machine generated summary divided by number of words in the machine generated summary. and similarly for ROUGE-L Recall is the LCS(human reference, machine generated) divided by number of words in human generated summary. Hence Longest common subsequence will be “It is cold outside” and hence the precision is 4/5 = 0.8, here denominator 5 : number of words in machine generated summary and similarly recall is 4/4 =1 , here denominator 4 : number of words in human generated summary. and finally the f1 score is harmonic mean of the precision and recall. Therefore
Precision : 0.8
Recall : 1
F1-Score : 0.889

Please correct me if I’m wrong!!

Hello @rmwkwok ,
According to the definition of the Longest common subsequence, we need to check for the longest ordered set of tokens that appear in both sequences (not necessarily consecutively). Even from the original paper titled : ROUGE: a Package for Automatic Evaluation of Summaries the definition was given in the same way. Please check the attached image and the original paper link for ROUGE metric calculation( Microsoft Word - WAS2004-ROUGE-Package-Final-One-Column.doc) and hence it is correct to consider “It is cold outside” as the longest common subsequence.

Please correct me if I’m wrong!!


Hello Charan,

I take back my previous post. I think you are right, unless they are discussing their own variant of rouge score. No luck finding someone uses that definition from a quick search.

Let me tag @chris.favila and see if he has any input to this.

Thanks for pointing this out and sharing the screenshot.


Hi @charan_chinni , I have the same observation and thoughts as you did. In my opinion, the slide puts a wrong calculation for ROUGE-L score. I did my own calculation which gives me a precision of .8, a recall of 1, and f1score of 0.889, so I voted you.

@chris.favila Hey Chris, could you help validate our thoughts and help correct the slide if it is indeed an issue? Thanks!

Wait, the paper DOES SAY say “strict increasing sequence”. And how did you took that as “ordered but not consecutive” ?
An index sequence - 2, 3, 4, 5 is a strict increasing and
An index sequence - 2, 4, 5, 8 is not a “strict” increasing. It is a randomly increasing.