C4W2 Exercise 2 DecodeLayer. Output is correct, error in the UnitTest

Hello, I have an error in exercise 2, DecodeLayer. The issue is in the unittest; I have 2 errors. However, the output of the exercise is correct.

Output:
Using embedding_dim=12 and num_heads=16:
q has shape:(1, 15, 12)
Output of encoder has shape:(1, 7, 8)
Output of decoder layer has shape:(1, 15, 12)
Att Weights Block 1 has shape:(1, 16, 15, 15)
Att Weights Block 2 has shape:(1, 16, 15, 7)

Expected Output
Output:
Using embedding_dim=12 and num_heads=16:
q has shape:(1, 15, 12)
Output of encoder has shape:(1, 7, 8)
Output of decoder layer has shape:(1, 15, 12)
Att Weights Block 1 has shape:(1, 16, 15, 15)
Att Weights Block 2 has shape:(1, 16, 15, 7)

UnitTest:
Failed test case: Wrong values in ‘out’.
Expected: [1.1810006, -1.5600019, 0.41289005, -0.03388882]
Got: [-0.61175704 -0.9107513 -0.14352934 1.6660377 ]

Failed test case: Wrong values in 'out' when we mask the last word. Are you passing the padding_mask to the inner functions?.
Expected: [1.1297308, -1.6106694, 0.32352272, 0.15741566]
Got: [-0.5599833  -1.0828896   0.05846525  1.5844076 ]

Any suggestions as to why this error might be occurring?

In advance, thank you very much.

Hi @LevValenzuela

The Expected output shows only the dimensions of the output (in that regard your implementation is correct).

What unit test tells is you, that the actual values are not the same as expected (in both cases). In one case the unit test just checks the values without padding and in the other - with padding. In other words, you passed the unit tests for data types, output shapes but not for the actual final output values.

You should very carefully check the code hints and see if your implementation does what you’re asked. Check if you’re using .mha1 vs mha2 (and what arguments they receive), if you apply layer norms (1, 2, 3) where appropriate, etc.

Let us know if you find your mistake.
Cheers

Hi @arvyzukai

In the first Multi-Head Attention layer (mha1), I utilize the input tensor X, X, X, a look-ahead mask, and return the scores.

In the second Multi-Head Attention layer (mha2), I use Q, enc_output, enc_output, a padding mask, and return the scores.

The normalization and attention weights appear to be correct.

I’m uncertain whether for the enc_output, I only need to pass the entire tensor or perhaps just a specific part of it.

Thnak you.

Hi @LevValenzuela

As far as I understand you’re doing everything correct.

It’s the entire enc_output (for which are specific parts you pass the padding_mask).

Ok… another point of failure could be if you forgot or got wrong the skip/residual connections (before normalization). For example, for the first layernorm, the input should be the sum of mult_attn_out1 and x, for the second - the mult_attn_out2 and Q1, and for the third - the sum of ffn_output and mult_attn_out2.

Is this the way you applied normalization?

Thank you, @arvyzukai. The issue occurred in layernorm2, specifically with the output from the first block.