What is major difference between BLEU score and BLEU modified

Hello @arvyzukai

I am sorry as I was going through the BLEU video, I didn’t find much of the difference or significance of using BLEU modified score with comparison to BLEU score. Can you explain me this concept with an better example than what instructor explained.

Also instructor mentions BLEU does not do good job in semantic mechanism and sentence structure, so why do we even use it any more, or do we even use as I had come across a query where learner had basically used same BLEU score, for which now I think is it really significant NLP tool?

Also I came across new terminology i.r.t. BLEU score from the ungraded lab that it depends on Brevity Penalty and Precision. Can you give some insight on this as the formula mentioned in the ungraded lab seemed a complex one.

Thank you in Advance.


1 Like

Hi @Deepti_Prasad

Could you provide more details what are you talking about specifically? I can’t quite recall the issue from the top of my head.

Yes, BLEU is just an “overlap” measure - how precisely does your model’s output compares to the test set references.

When we have two or more models and we want to compare them quickly over a handful of samples, we usually do that with some benchmark (BLEU is one of them).
Benchmarks is a controversial topic (and not only in NLP but in other domains too, like vision and pretty much all others). Even evaluating models on human feedback is a controversial (for example, translating Indian to English is very region (and also time) dependent - USA vs. English vs. Scottish vs. International. Also the “understanding” of Indian, I’m pretty sure, is different in different parts of India.
In other words, perfect translation is not achievable (it’s a subjective matter) and Benchmarks can be a “signal” which models are doing better than others.

I found this simple example (which I believe was a link in the Course?) to be helpful. It explains the terms in simple language and also provides a concrete example.

Just a reminder:

While discussing the Benchmarks, it is important to not forget the overall picture - at the end, what is most important is your “goal” (money, time or however you measure the success). To achieve your goals, you might use models, which in the case of ML/NLP are heavily dependent on the data. If the data are worthless, no model evaluation (benchmark) would be important. So spending you time accordingly is important. In other words:

  • your goal > train/val/test sets > evaluation scores

I’m pretty sure you knew that and the original question is not about that, but reminding ourselves this is never too much. As the Goodhart’s law states:
“When a measure becomes a target, it ceases to be a good measure”



So you are stating BLEU OR ROUGE ARE ALL only model analysis tool based on performance to check how well it did???

I did check further videos which mentions f1 score is calculated using blue(precision) and rouge(recall) score

I do remember evaluation metrics what you are trying to state, but I wanted to understand if there was any other significance other than what you mentioned.

Also BLEu score being benchmark for this sounds again questionable to me as it includes reference from human feedback or vice versa. Being said that I am not saying human feedback is right or wrong, but checking a model analysis performance again from human is can be both vice versa if I see an overall picture of scoring, performance and evaluation here.

Basically BLEU OR ROUGE are type of RLHF!!!

But I still didn’t understand the difference between Bleu score and bleu modified shown in the video, it is present in the first week of course 4

1 Like

:slight_smile: I’m not sure why are you surprised, Deepti. Yes, BLEU and ROUGE are “traditional” evaluation metrics. The quality of generated texts got so good that we need other metrics, like G-Eval (which too has its own flaws).

Did you misunderstand them to be loss/cost functions? The loss function (which directly tells how to update the model’s weights) for all base language models that I know is cross entropy (which is related to perplexity). In other words, we update models by the likelihood probabilities they generate (and not any “evaluation” metric).

I’m not sure what you mean by that, but almost every dataset is “output” from humans (especially in NLP).

No, RLHF stands for Reinforcement learning from human feedback”. It is used to “align” model’s outputs with human (whoever they are) preferences. In simple terms, the “RL part” tries to learn a policy (what is it that we humans actually prefer) and according to that policy, the model has to adjust it’s probabilities for the upcoming token. In other words, it directly influences the loss function (the model not only has to assign high probability for the most probable token but also for the one which we prefer).

The modified version simply does not have the option to count the same words twice (or more precisely more times than they appear in the reference). Compare the 2:17 and 3:45 times in the video.


P.S. I probably won’t be able to respond for about a week (away on vacation)


I am not surprised, I am only asking questions. I knew superficially about BLEU but some terms has really got me confused rather than surprised. I will try to find myself. Thank you.

When did I mention that I am confused it as loss/cost/or anything, all I asked if BLEU hold significance and in what extent. How is Bleu modified different from BLeU score.

Anyways thank you for your time and have a great vacation.



That’s brilliant and it was totally new to me. Thanks for the link! For anyone else who sees this, the Wikipedia page is definitely worth a look.


Understood :+1: I interpreted the caps and three question marks as a surprise reaction, that is the only reason I mentioned it because written communication is not very clear about what part of the questions should be addressed.

The answer to BLEU score significance is kind of subjective, but nowadays for big LMs it is irrelevant (it might be used as a metric but definitely not as something important). While on the other hand, depending on your task, it could be useful (for example, “small” models, when tokens are pretty clear, tasks like Named Entity Recognition (where overlap score might be informative) etc.)

Thank you!


1 Like