Q about keras doc of tf.keras.layers.MultiHeadAttention

This concerns the documentation:

" This layer first projects query , key and value . These are (effectively) a list of tensors of length num_attention_heads , where the corresponding shapes are [batch_size, , key_dim], [batch_size, , key_dim], [batch_size, , value_dim]."

Should the corresponding shape be of all dim == 3? It seems to be missing the entry in the middle. I am guessing they are “key_seq_len” and “val_seq_len”. i.e. the max. length of the sentence (the batch) in words in NLP context.

This seems like a more complicated layer than average, and it would be nice if they take care not to have omission or typo, which add to cognitive burden to who are new to this.

Are you asking about the Keras documentation, or the instructions provided with this assignment?

Keras official doc. While it isn’t directly instruction from assignment, I thought it may be ok to ask this since we are using this keras class out of the box, instead of building it from scratch. Thanks.

I recommend you file a comment with the Keras authors about their documentation.

Thanks. I will file a doc bug with the tf keras team.

But do you think I am likely correct? This will also help me solidify my understanding of multi head attention. I know this isn’t strictly part of the assignment, but it is used in the context of the assignment.

I can’t answer, I’m not a Keras expert.

I believe we should. At least I’ve already filed an issue about that :slight_smile:

TF Keras implementation uses per-head mask with shape (batch_size, 1, sequence_length).