C4_W2_Assignment Multi-Head Attention input prep

Halit_Boyar · September 18, 2022, 2:34pm

In the assignment, we use multi-head attention but never duplicate our inputs to match the number of heads. So,

Input size → num_batch, num_seq, encoding_dim

The last dimension should be copied num_heads times in order to have multiple heads and the result must be: (num_batch, num_seq, num_heads, encoding_dim)

But, in the assignment implementation we use:

d_head = d_feature // n_heads

Then we divide d_features into d_heads.

I don’t understand this approach and I am very confused. I would be really happy to hear what you think.

arvyzukai · September 19, 2022, 7:27am

Hi @Halit_Boyar

That is a good question and a lot of learners continue without stopping to ponder and understanding this.

Each head “operates” on a different subpart of the encoding_dim. So for example, after Embedding layer you have:
(2, 10, 32) # (num_batch, num_seq, encoding_dim)

then in the Encoder block you have weights to form the Queries, Keys and Values. Here comes the part you are asking for - most of the time and in this course, the first Encoder block is similar to the ones that come after it - so it has to have the same size of the output as input. So as you correctly noticed, what it does it divides the 32 space to n_heads and each head has it own subpart (for example if n_heads would be 4, each head would “operate” on 8 embeddings/encodings).

You could have the dublicating layer to match the number of heads, also you could have the first Encoder block that uses different size of weights matrix from the subsequent ones and other architectures, but there is no need for it - there is nothing special about the Embedding layer - if the task that you are training for requires that head1 and head2 would receive the same embedding values, then the model should learn that (for example, the embedings[:8] would have the same values as the embeddings[8:16]).

So most of the time the size of W_q, W_k and W_v (Queries, Keys and Values weight matrices) would have the same shape (for example, (32, 32)) so that the output of the Encoder would match the input size and in this scenario each head would have its own subpart (for example, [0:8], [8:16], [16:24], [24:32)).

Topic		Replies	Views
Multiheaded Attention - Number of heads and Dim of heads NLP with Attention Models week-module-2	13	4078	May 7, 2023
Encoder blocks dimension NLP with Attention Models week-module-3	3	556	August 9, 2022
Sequence length, d_head variable NLP with Attention Models week-module-2	6	667	April 28, 2023
Week 2 Assignment: How are the multiple heads in the multi-head attention created? NLP with Attention Models week-module-2	1	529	December 23, 2022
Multiheaded attention question Sequence Models week-module-4 , coursera-platform	1	279	January 6, 2024

C4_W2_Assignment Multi-Head Attention input prep

Related topics