C4W1 - UNQ_10 - How to debug choice between valid translations

When I ran the final unit test on UNQ10:

# test mbr_decode
w1_unittest.test_mbr_decode(target=mbr_decode, score_fn=average_overlap, similarity_fn=rouge1_similarity)

I got the following result:

Expected output does not match
 3  Tests passed
 1  Tests failed

So, I added various print stmts to my code in order to better understand where the problem lay. The results I got were as follows:

sentence I eat soup.
n_samples 4
temperature 0.6
---> 0 -0.0003108978271484375 0.999689150496573 Ich iss Suppe.
---> 1 -0.0003108978271484375 0.999689150496573 Ich iss Suppe.
---> 2 -0.000225067138671875 0.9997749581870365 Ich esse Schweine.
---> 3 -0.000110626220703125 0.9998893798981516 Ich esse Suppe.

 3 -0.000110626220703125
translated_sentence Ich esse Suppe. 0.9998893798981516
Expected output does not match
sentence I am hungry
n_samples 4
temperature 0.6
---> 0 -1.2909164428710938 0.27501862870316235 Ich bin hungrig da
---> 1 -2.09808349609375e-05 0.9999790193851352 Ich bin hungrig.
---> 2 -2.09808349609375e-05 0.9999790193851352 Ich bin hungrig.
---> 3 -2.09808349609375e-05 0.9999790193851352 Ich bin hungrig.

 3 -2.09808349609375e-05
translated_sentence Ich bin hungrig. 0.9999790193851352
sentence Congratulations!
n_samples 4
temperature 0.6
---> 0 -3.814697265625e-06 0.9999961853100103 Herzlichen Glückwunsch!
---> 1 -3.814697265625e-06 0.9999961853100103 Herzlichen Glückwunsch!
---> 2 -3.814697265625e-05 0.9999618537549303 Ich gratuliere Ihnen!
---> 3 -3.814697265625e-06 0.9999961853100103 Herzlichen Glückwunsch!

 3 -3.814697265625e-06
translated_sentence Herzlichen Glückwunsch! 0.9999961853100103
sentence You have completed the assignment!
n_samples 4
temperature 0.6
---> 0 -0.000232696533203125 0.9997673305385353 Sie haben die Aufgabe erfüllt!
---> 1 -4.9591064453125e-05 0.9999504101651634 Sie haben die Abmeldung abgeschlossen!
---> 2 -2.47955322265625e-05 0.9999752047751801 Sie haben die Abtretung abgeschlossen!
---> 3 -2.47955322265625e-05 0.9999752047751801 Sie haben die Abtretung abgeschlossen!

 3 -2.47955322265625e-05
translated_sentence Sie haben die Abtretung abgeschlossen! 0.9999752047751801
 3  Tests passed
 1  Tests failed

The error seems to lie in choosing “Ich esse Suppe”, instead of (I assume) “Ich iss Suppe”.

However, when I type either sentence into Google Translate, I get the English translation “I eat soup”.

When I look at various web pages that describe the conjugation of the German verb “to eat” by Googling “German verb eat” (e.g. Essen German Conjugation | Study.com or Conjugation of essen (to eat) in German | coLanguage), I find that “esse” tends to go with “I” and “iss” (or “isst”) tends to go with “you” or “he/she/it”.

So, now I’m left wondering both how to debug(?) in order to favor one valid translation over another and if there’s more stochasticity in NMT’s other than in the logsoftmax sampling function. Or, is there some other reason that explains why my code chose ‘esse’ over ‘iss’.

If anyone (who can) wants to see my code, the Lab ID is mjhlxfqb. I realize I’m getting hung up on what is probably a minor point, but I have to admit it’s really bugging me.

Hi @Steven1,

For calculations of scores in your Ex 10, you are using weighted average overlap, but if you pay attention to the function parameters for Ex 10, you’ll notice it is already passing in a function to help calculate the scores. Use that function instead.


P.S I have removed all of the print statements from your Ex 10 as they’d cause you grading issues. If you added extra print statements elsewhere in the notebook as well, be sure to remove them before submitting for grading, otherwise you’d end up having absurd errors by the autograder.

1 Like

Hi @Steven1

Nice catch :slight_smile: - in reality in German the translation should be “Ich esse Suppe”. I was wondering why you have such high scores (if I interpret your output correctly)? For this particular sentence, the test case should produce:

('Ich iss Suppe.',
{0: 0.8571428571428572, 1: 0.8571428571428572, 2: 0.7619047619047619, 3: 0.8571428571428571})

Thanks!!! Now, the world makes sense again.

1 Like

@arvyzukai - nope - I was looking at log probs and probs as my outputs. When I look at the weighted and non-weighted avgs, using scores, I see this:

sentence: I eat soup.
non-weighted avgs: {0: 0.8571428571428572, 1: 0.8571428571428572, 2: 0.7619047619047619, 3: 0.8571428571428571}
weighted avgs: {0: 0.8571387701816032, 1: 0.8571387701816032, 2: 0.7619111199457392, 3: 0.857142857142857}

For the non-wted avgs, the “correct” sentence loses by 1e-16, which I believe qualifies as numerical noise (actually the values make more sense as fractions .857142… = 6/7) For such a short sentence, I should probably try hand calculating the scores…

As a side question, I’m wondering how NMT does with colloquialisms. For inatance, my grandparents were German emigres. So, I know (strongly believe) a German would express hunger as “Ich habe hunger” (i.e. “I have a hunger” , not “I am hungry”) Am I correct in believing that these are the sort of (culturally dependent?) translations that would give NMT trouble?

I’m not an expert on NMT but as far as I know - not really. It of course depends on the dataset that the model was trained on (and on model architecture as well) but usually this particular example should be easily captured because meaning does directly follow from their parts (“I have hungryness” is not that far from “I am hungry”). They might feel not intuitive to people (English speaking) but “statistically” I think they are not that hard to grasp.

Idioms or metaphors on the other hand are more problematic to NMT (e.g., “my job is a jail” as a metaphor, or “spill the beans” as an idiom (which means reveal secret information unintentionally or indiscreetly) because here the meanings do not follow from their parts.