Issues with your assignment as per grade cell
GRADED FUNCTION: scaled_dot_product_attention
1.softmax is normalized on the last axis (seq_len_k) so that the scores add up to 1.
YOU DONT REQUIRE AXIS FOR THIS CODE RECALL
GRADED FUNCTION: DecoderLayer
For Block 1, it is clearly mentioned Dropout will be applied during training, so you do not require training=training for your self-attention mult_attn_out1. Same issue with mult_attn_out2
GRADED FUNCTION: Decoder
No mistakes
GRADED FUNCTION: Transformer
No mistakes
GRADED FUNCTION: next_word
For the code line you were suppose to use create_padding_mask but you have used create_look_ahead_mask
Create a look-ahead mask for the output THIS IS THE MAIN REASON BEHIND YOUR ERROR
Regards
DP