Bug C4_W1_Ungraded_Lab_3_Bleu_Score Incorrectly Implemented

The intent of the lab is to illustrate manual calculation of a BLEU score that is verified with a “sacrebleu” library result.
The sacreblue library reports 0.0 for the scores for the two tests illustrated.
0.0 and 0.0 do not compare well with the 27.6 and 35.3 calculated in steps 1-4 done in the lab.
The lab has failed in its basic premise.

The sacrebleu library has a number of defaults (separate tokenization of capitalized words, base n-gram length) and argument expectations (list of lists) that could lead to the 0 scores with the lab inputs.
Looks like the lab code using sacrebleu is incomplete and cannot produce a reasonable result.

Hi @Gregory314159

Could you send me your lab code to check, because I could not replicate your results (my scores are not 0.0 and they compare well with sacrebleu) . Maybe you tinkered the code somewhere?

The values are still 0
print(

"Results reference versus candidate 1 our own BLEU implementation: ",

round(bleu_score(tokenized_corpus_cand, tokenized_corpus_ref) * 100, 1),

)

Results reference versus candidate 1 our own BLEU implementation: 43.6
print(

"Results reference versus candidate 1 sacrebleu library BLEU: ",

round(sacrebleu.corpus_bleu(wmt19_can_1, wmt19_ref_1).score, 1),

)

Results reference versus candidate 1 sacrebleu library BLEU: 0.0

1 Like

Hi @Thomas1

Can you send me your notebook for me to check?

C4_W1_Ungraded_Lab_3_Bleu_Score.ipynb (54.5 KB)
Sure

hi @arvyzukai
It is not the user side error, it is a bug in the lab itself.

The supplied lab does not correctly use the sacrebleu library.
The supplied lab does not indicate that the user should edit any of the erroneous cells.

This error has existed for some time as there are other posts describing it.

Interested readers can obviously look up the sacrebleu documentation, but that is not indicated in the notebook in any way.

Again, the intent of the lab expressed in the opening cells, is to illustrate that a manual example of calculating a BLUE score compares with the sacrebleu library.
It does not compare well at all, due to errors on the authoring (not the student) side of the notebook.

Sacre bleu! What a mess. GitHub - mjpost/sacrebleu: Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
nltk BLUE score calculation NLTK :: nltk.translate.bleu_score

in Step 5 try:

print(
    "Results reference versus candidate 1 sacrebleu library sentence BLEU: ",
    #round(sacrebleu.corpus_bleu(candidate_1, reference).score, 1),
    round(sacrebleu.sentence_bleu(candidate_1, [reference]).score, 1),
)
print(
    "Results reference versus candidate 2 sacrebleu library sentence BLEU: ",
    #round(sacrebleu.corpus_bleu(candidate_2, reference).score, 1),
    round(sacrebleu.sentence_bleu(candidate_2, [reference]).score, 1),
)

Results reference versus candidate 1 sacrebleu library sentence BLEU: 27.6
Results reference versus candidate 2 sacrebleu library sentence BLEU: 35.3

hi @Gregory314159

Thank you for noting this bug. I submitted the issue and it will be fixed as soon as possible.

Thanks