In the exercise 3 (Decoder) I am getting the following errors:
Failed test case: Wrong values in x.
Expected: [1.6461557, -0.7657816, -0.04255769, -0.8378165]
Got: [ 1.5847092 -0.22151496 -0.17638591 -1.1868083 ]
Failed test case: Wrong values in att_weights[decoder_layer1_block1_self_att].
Expected: [0.51728565, 0.48271435, 0.0]
Got: [0.49889773 0.5011023 0. ]
Failed test case: Wrong values in outd when training=True.
Expected: [1.6286429, -0.7686589, 0.00983591, -0.86982]
Got: [ 1.5842563 -0.35723066 -0.06137132 -1.1656543 ]
Failed test case: Wrong values in outd when training=True and use padding mask.
Expected: [1.390952, 0.2794097, -0.2910638, -1.3792979]
Got: [ 1.1789213 0.6112976 -0.33207467 -1.4581443 ]
I have followed all the steps and hints given in the code block and have passed all the previous units tests in the assignment but I am unable to solve the issue. I have looked into similar posts and tried the solutions but the issue still persists.
The codes you shared from decoder recall grades are correct. I am suspecting in the DecoderLayer def call cell, for block 1, make sure you have used look ahead mask and set the return attention scores to true.
You have added training=training to block1 and block2 which is not suppose to be added as the instruction mentions dropout will be added during training. So training=training is only added to ffn output while applying dropout.
I tried removing the training parameter from the blocks.
Didn’t make a difference in the unit tests for the decoder.
Not sure what is going wrong with the code.
Then the next suspect would be to check previous grade cell codes. I would check scaled dot product attention as your error points towards wrong value of x.
Kindly share grade cell screenshot of both scaled dot product as well as encoder grade cells.
Errors in Scaled dot product attention grade cell.
For code multiply q and k transposed. You are using incorrect python function code for multiply q and k.
In the additional hints section just before the grade cell, it mentions
you may find tf.matmul useful for matrix multiplication (check how you can use the parameter transpose_b)
Next to calculate dk, kindly use tf.shape rather than k.shape. Also as you know dk is the dimension of the keys, which is used to scale everything down so the softmax doesn’t explode. So dimension reduction is [-1] not -2.
In the same next code line, to calculate scaled attention logits, in denominator you are suppose to use tf.math.sqrt(dk) and not dk**0.5 as dk come in square root as per calculation.
While adding mask to the scaled tensor, your code is right but we have seen even not mention decimal point makes different to scaled weight, so instruction mentions to Multiply (1. - mask) by -1e9 before but you multiplied (1-mask). Make sure you multiply just the way instructions mentions before the grade cell.
While softmax is normalized, you do not require to add any axis argument as you are only require to use right activation function which you did. So remove axis=-1.
Let me know after these corrections, what is the progress.
I apologize for causing you trouble.
I had followed your instructions and removed the training parameter from the attention blocks. However, when it did not change anything I tried to run it again with the training parameter in.
As per your response I have commented out the parameter again, however, the result stays the same.
I have shared the screenshot with you. Please let me know if I am making any mistakes.
My apologies this time as I missed noticing minor error, in the add positiong encoding to word embedding, you used
[:, seq_len, :] where as it should be [:, :seq_len, :]
check the instructions before the grade cell it mentions this. @hj320