C5 - W4 - Transformers Architecture, 3rd June 2021 version

This assignment was updated on June 3rd, exactly one day before I got to work on it.

I’m a skilled programmer but still learning intricacies of Python and APIs of packages such as numpy, TensrFlow etc. This particular assignment is still the toughest across all five course in this specialization. I had to read up on TF and numpy docs, and scan through previous postings in the forum whenever I ran into issues.

I haven’t seen what this notebook looked like before, but based on earlier comments in this forum, the updated notebook goes a long way to resolving user frustrations and comments.

One small issue… in the DecoderLayer assignment, the args list placeholders for self.mha1() and self.mha2() is not correct, i.e. there are four "None"s when there should be five. We need args for Q, K, V, training and attention_mask. Can you confirm?

Hi @rvh, in the documentation of the MultiHeadAttention, the parameters are: query, value, key, attention_mask, return_attention_scores, training. So, there is only one mask to pass.

Moreover, if you check the attention equation there is only 1 mask in the equation.

In the exercise, the placeholder code for Block 1 is as follows:

    # BLOCK 1
    # calculate self-attention and return attention scores as attn_weights_block1 (~1 line)
    attn1, attn_weights_block1 = self.mha1(None, None, None, None, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)

As you pointed out, the self.mha1() call takes six arguments: query, value, key, attention_mask, return_attention_scores, training, but there are only four "None"s in the placeholder code. Shouldn’t there be five instead, since you need to specify query, value, key, attention_mask and training?