Hey @tetamusha,
An intriguing question, indeed I never really thought about this, since in the specialization, the classification of evaluation metrics presented is for the metrics used for “evaluating word embeddings”, so I thought that the classification restricted to those metrics only. Bleu and Rouge, on the other hand, are metrics used for evaluating NMT systems.
Now, when I did a Google Search to find out the difference between “Intrinsic and Extrinsic” metrics, without any further context, around half of the searches mentioned them in the context of word embeddings (or language representations). And when I did Google Search on “Bleu” and “Rouge” individually, most of the resources neither mention “intrinsic” nor “extrinsic”.
However, in this reference regard BLEU, I found something which might be of our interest.
Because BLEU itself just computes word-based overlap with a gold-standard reference text, its use as an evaluation metric depends on an assumption that it correlates with and predicts the real-world utility of these systems, measured either extrinsically (e.g., by task performance) or by user satisfaction. From this perspective, it is similar to surrogate endpoints in clinical medicine, such as evaluating an AIDS medication by its impact on viral load rather than by explicitly assessing whether it leads to longer or higher-quality life.
Additionally, if we look up the informal definitions of intrinsic and extrinsic evaluation metrics, we will find something as follows:
- Intrinsic Evaluation — Focuses on intermediary objectives (i.e. the performance of an NLP component on a defined subtask)
- Extrinsic Evaluation — Focuses on the performance of the final objective (i.e. the performance of the component on the complete application)
I have borrowed these definitions from this blog. So, in the case of BLEU and ROUGE scores, there is no sub-task present, and only the final objective, which is machine translation. However, unlike the other extrinsic evaluation metrics that we know of (which don’t involve any ambiguity, for instance, if we evaluate a model on NER, all the words will have well-defined entities), BLEU and ROUGE can only be relied upon if we have good human translations (which can be multiple for a given sentence), so, they involve some level of ambiguity. So, in my opinion, both these metrics lie somewhere along the middle ground.
Let me tag in some other mentors so that we can get their perspectives as well on this. Hey @arvyzukai and @reinoudbosch, can you please let us know your takes on this? Thanks in advance.
Cheers,
Elemento