Please look at the architecture of the decoder layer.
You have changed the comment manually which explains the additional dropout layers in your implementation. Dropout needs to be applied only once i.e. to the output of the feed forward network.
From starter code:
# BLOCK 1
# calculate self-attention and return attention scores as attn_weights_block1.
# Dropout will be applied during training (~1 line).
Yours:
# BLOCK 1
# calculate self-attention and return attention scores as attn_weights_block1 (~1 line)
# LINE OF CODE
# apply dropout layer on the attention output (~1 line)
# LINE OF CODE APPLYING DROPOUT
Please follow these steps to refresh your workspace if required. Change code at places only where required. See the section Important Note on Submission to the AutoGrader
in the notebook as well.