In the benchmarks, when we see ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L), what is the actual underlying score? Recall or Precision or F1 or some other transformation of some combination of these values?

This is an article that explains each one of them and their respective formulas with regards to precision, recall and F1. To go straight to the endpoint:

Mean of the F1 scores will gives us the full ROUGE-1 score for dataset. (similar to ROUGE-2…)

@gent.spah I know this question has been answered, but I tried looking for other posts that get at this inquiry, but figured rather than just starting a new one it might be good to add on here.

So… I was wondering if you had any ‘intuition’:

I mean in traditional ML we calculate precision and recall as follows:

But both from the lecture and your linked article precision and recall are calculated like this:

While BLEU is not too bad, I’m going to have to sit and think and go over ROUGE a few more times to make sure I get that.

However, my question actually relates to-- I know the context is different, but how are these versions of recall and precision equivalent ?

Or are they being used ‘in a different way’ ?

… It is at least not ‘obviously’ jumping out at me…

I think this is it, the principle of the formula is the same just the application is different!

OIC, so still:

Precision = the proportion of positive identifications that were actually correct

and

Recall = the proportion of actual positives that were correctly identified by the model

So, I I guess ‘same folks, different strokes’ (i.e. formulae).