C4_W2_Assignment Multi-Head Attention input prep

In the assignment, we use multi-head attention but never duplicate our inputs to match the number of heads. So,

Input size → num_batch, num_seq, encoding_dim

The last dimension should be copied num_heads times in order to have multiple heads and the result must be: (num_batch, num_seq, num_heads, encoding_dim)

But, in the assignment implementation we use:

d_head = d_feature // n_heads

Then we divide d_features into d_heads.

I don’t understand this approach and I am very confused. I would be really happy to hear what you think.

Hi @Halit_Boyar

That is a good question and a lot of learners continue without stopping to ponder and understanding this.

Each head “operates” on a different subpart of the encoding_dim. So for example, after Embedding layer you have:
(2, 10, 32) # (num_batch, num_seq, encoding_dim)

then in the Encoder block you have weights to form the Queries, Keys and Values. Here comes the part you are asking for - most of the time and in this course, the first Encoder block is similar to the ones that come after it - so it has to have the same size of the output as input. So as you correctly noticed, what it does it divides the 32 space to n_heads and each head has it own subpart (for example if n_heads would be 4, each head would “operate” on 8 embeddings/encodings).

You could have the dublicating layer to match the number of heads, also you could have the first Encoder block that uses different size of weights matrix from the subsequent ones and other architectures, but there is no need for it - there is nothing special about the Embedding layer - if the task that you are training for requires that head1 and head2 would receive the same embedding values, then the model should learn that (for example, the embedings[:8] would have the same values as the embeddings[8:16]).

So most of the time the size of W_q, W_k and W_v (Queries, Keys and Values weight matrices) would have the same shape (for example, (32, 32)) so that the output of the Encoder would match the input size and in this scenario each head would have its own subpart (for example, [0:8], [8:16], [16:24], [24:32)).