Are BLEU and ROUGE intrinsic or extrinsic evaluation measures?

BLEU and ROUGE are metrics for evaluating machine translation. So, when we use there metrics, we are evaluating an NMT model in a specific task, which sounds like extrinsic evaluation.

On the other hand, course 2 exemplifies intrinsic evaluation only on language models, i.e. a probability distribution over a sequence of words, and NMT models don’t seem to be language models. So, does the intrinsic/extrinsic classification of evaluations even apply to NMT, BLEU, ROUGE etc?

Hey @tetamusha,
An intriguing question, indeed :thinking: I never really thought about this, since in the specialization, the classification of evaluation metrics presented is for the metrics used for “evaluating word embeddings”, so I thought that the classification restricted to those metrics only. Bleu and Rouge, on the other hand, are metrics used for evaluating NMT systems.

Now, when I did a Google Search to find out the difference between “Intrinsic and Extrinsic” metrics, without any further context, around half of the searches mentioned them in the context of word embeddings (or language representations). And when I did Google Search on “Bleu” and “Rouge” individually, most of the resources neither mention “intrinsic” nor “extrinsic”.

However, in this reference regard BLEU, I found something which might be of our interest.

Because BLEU itself just computes word-based overlap with a gold-standard reference text, its use as an evaluation metric depends on an assumption that it correlates with and predicts the real-world utility of these systems, measured either extrinsically (e.g., by task performance) or by user satisfaction. From this perspective, it is similar to surrogate endpoints in clinical medicine, such as evaluating an AIDS medication by its impact on viral load rather than by explicitly assessing whether it leads to longer or higher-quality life.

Additionally, if we look up the informal definitions of intrinsic and extrinsic evaluation metrics, we will find something as follows:

  • Intrinsic Evaluation — Focuses on intermediary objectives (i.e. the performance of an NLP component on a defined subtask)
  • Extrinsic Evaluation — Focuses on the performance of the final objective (i.e. the performance of the component on the complete application)

I have borrowed these definitions from this blog. So, in the case of BLEU and ROUGE scores, there is no sub-task present, and only the final objective, which is machine translation. However, unlike the other extrinsic evaluation metrics that we know of (which don’t involve any ambiguity, for instance, if we evaluate a model on NER, all the words will have well-defined entities), BLEU and ROUGE can only be relied upon if we have good human translations (which can be multiple for a given sentence), so, they involve some level of ambiguity. So, in my opinion, both these metrics lie somewhere along the middle ground.

Let me tag in some other mentors so that we can get their perspectives as well on this. Hey @arvyzukai and @reinoudbosch, can you please let us know your takes on this? Thanks in advance.


Hi @tetamusha and @Elemento,

I found this paper which may provide some insights.

I have found two YouTube videos that describe intrinsic and extrinsic evaluation in the context of clustering algorithms. They seem to describe intrinsic evaluation as any kind of surrogate metric applied to the output of the model/algorithm, even when this metric compares the model output with ground-truth values. Their example of extrinsic evaluation is when the output of the model is used to solve an actual real-world problem.

In the context of NLP, I’ve found the same blog post as @Elemento and all metrics (BLEU, ROUGE, F1 Score, precisino, recall, accuracy etc.) are listed under intrinsic evaluation.

This leads me to believe that even comparing a model against a reference dataset (such as a test dataset) is considered to be intrinsic evaluation, unless the reference dataset directly reflects the real-world problem that the model is supposed to tackle.

For example, if we intend to apply an NMT model in news articles and compute its BLEU or ROUGE scores in a generic test dataset, we are performing intrinsic evaluation. But if we compute the same scores in news articles that have already been human-translated and which are a subset of the data in which we plan to apply the model in production, then it could be considered extrinsic evaluation.

Hey @tetamusha,

This seems to be a good intuition to me. However, I would like to add one thing, which may seem funny to you :joy: Ultimately, it won’t matter whether it’s “intrinsic” or “extrinsic” evaluation, cause they are just “categories” and nothing else. It’s not like that if I mention, “I have evaluated this model intrinsically”, and the readers would be able to get anything from that, unless the other details are mentioned. And if the other details are mentioned, then it doesn’t matter to the reader whether it’s an intrinsic or extrinsic evaluation, I believe.