Multiheaded Attention - Number of heads and Dim of heads

In Transformer Summary assignment, the dimension of the heads is calculated by dividing the feature embedding by the number of heads. In other words, each head is part of the initial word embedding for the q,k,v. Is this the right method or is this a specific implementation adopted for this assignment?

In Deep Learning Specialization (course 4 week 4), Andrew Ng explains the multiheads as several equal and parallel computations where there are separate learned q.W,k.W,v.W for each head and NOT just reshaping of the original embedding and then concatenating them.

In the implementation of the multihead in transformer summary assignment, the only benefits seems that each head can be computed parallel as against computing all as single head but that is not the original idea behind multihead attention.

To check this, I calculated multihead attention for some sample q,k,v first for entire embedding (without d_head = d_feature / n_heads) and then for several heads and subsequent concatenation. I find the results are same whether it is single head or multihead (except for the differences that might arise due to linear layer during multiple iterations of the decoder block).

Kindly clarify whether the d_head should be d_feature / n_heads or each d_head should be same as dimension of word embedding of q / x.

Hi @nmurugesh

Yes, this is true - this is how you calculate with how large dimension each head will operate on.

Not entirely true. Each head get its own projection, or in others words, each head gets its own compressed/transformed version of embeddings.

As in my previous point I think you misinterpret the code. I recently answered a similar question which might help illustrate the point that even the operation is single (to get Q, K, V) but the underlying channels of information are separate. Or in other words emb \cdot W_q is a single operation (for efficiency) which produces one output, but later the output is split up (isolated) for different heads.

I would doubt the results if the training is done long enough for a dataset large enough - multi-head should be superior to single head. In other words, if your results are same there must be some underlying reason.

The head dimension is definitely d_feature / n_heads. To convince yourself can :

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

  • num_heads – Number of parallel attention heads. Note that embed_dim will be split across num_heads (i.e. each head will have dimension embed_dim // num_heads).

Hi, Thanks for the reply.

The attention paper might be referring to d_head = embed_dim // num_heads as a means to reduce the computational costs.

The issue is not the size of the d_head being less than d_model but each d_head being a partition of columns in d_model embeddings instead of each of them separately being learned by the model

I am getting same result for both single head and multi-head because we are partitioning vertically the d_model into several columns - the initial reshaping and subsequent concatenation does not in anyway alter the contents of the embedding as dot product is done between partitioned columns.

More specifically, the current code for causal attention in transformer summarizer is:

        tl.Branch( # creates three towers for one input, takes activations and creates queries keys and values
            [tl.Dense(d_feature), ComputeAttentionHeads], # queries
            [tl.Dense(d_feature), ComputeAttentionHeads], # keys
            [tl.Dense(d_feature), ComputeAttentionHeads] # values

In the above code, the result of the dense layer is split into multiple heads by partitioning vertically and then scaled dot product attention is taken.

Instead the d_heads can be learnt by a dense layer and then scaled dot product attention could be derived:

[Dense(d_heads) → Calculate scaled dot product attention for each head → concatenate the results ] for each q,k,v

I have verified that this is the way it is done in the transformer assignments of deep learning specialization and also in tensorflow tutorial “Neural machine translation with a Transformer and Keras - Neural machine translation with a Transformer and Keras  |  Text  |  TensorFlow)”

In the above tensorflow tutorial, the d_model is the parameter passed to the multihead attention layer of the encoder and decoder layers. The tensorflow tutorial code for your reference:

class Encoder(tf.keras.layers.Layer):
  def __init__(self, *, num_layers, d_model, num_heads,
               dff, vocab_size, dropout_rate=0.1):

    self.d_model = **d_model**
    self.num_layers = num_layers

    self.pos_embedding = PositionalEmbedding(
        vocab_size=vocab_size, d_model=d_model)

    self.enc_layers = [
        for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(dropout_rate)

Hey @nmurugesh and @arvyzukai,
I spent more than an hour understanding your takes on this, since both seemed to be correct to me :joy: and I guess, both of them are indeed correct as well. What Prof Andrew mentioned in DLS C5 W4 is correct, and what is mentioned in NLP C4 W2 is also correct. It’s just the use of notation that is causing the confusion. I will be referring to one DLS thread here.

I believe that PyTorch and Tensorflow uses different notation for the same thing. I have borrowed the below image from the reference.


As you can see in W^Q that there is indeed a split in embedding_dim for the n_heads along the horizontal direction, but there is no split in the vertical direction. So, what @arvyzukai has been mentioning all along is the split along the horizontal direction, and what @nmurugesh has been mentioning all along is the no-split across the vertical direction.

In DLS, Prof Andrew took the matrices for the different heads as W_i^Q, W_i^K, W_i^V, i.e., there were 3 * n_heads matrices each of dimension (embedding_dim // n_heads, embedding_dim). In other words, there was no split, since the matrices taken were already in their correct shapes.

But in NLP, we have taken them as 3 matrices only, each of dimension (embedding_dim, embedding_dim), and for axis = 0, we are considering a logical split so that the first dimension is split into embedding_dim // n_heads.

In both the cases, if you will see, after performing concatenation, we will get 3 matrices of dimensions (embedding_dim, embedding_dim). Let me know what you guys think about this.


Hi @nmurugesh

OK, then we are more on the same page now :+1:

Could you clarify more - same results as same number values (which is practically impossible) or same results as same diagram “on paper” or something else?

Yes, that is what prepares the inputs for each “head”, each partitioning is the numbers that each head will have to “crunch” :slight_smile:

It depends what you call embedding here. The [tl.Dense(d_feature) altered the contents of the initial embeddings (and created Q matrix) and the subsequent ComputeAttentionHeads does not alter the values of Q but reshapes them for DotProductAttn_in3.

That’s exactly right.

I’m not sure I understand your point fully but that is somewhat similar what happens in the green box (the dot product between “blue”, “red”, “green”, “purple” slices of Q and K in the @Elemento answer’s picture):

Forgive for not taking a closer look at the Tensorflow tutorial since I do not have much time right now and so I will not comment on that :slight_smile:


Hey @Elemento

Nice illustration (and the DLS thread), thank you for sharing :+1: it very nicely illustrates the point I was trying to make, especially the bigger picture

Interesting :slight_smile: I wish I had more time to dive deeper into this :slight_smile: Do you have any insights comparing PyTorch and Tensorflow approaches?

Hi @elemento and @arvyzukai, thanks for your replies. I still however conclude that what is mentioned in NLP C4 W2 is not correct as against that given in deep learning specialization

The two approaches are not same unlike what @elemento says. This can be seen by looking at the code for multi-head attention - tf.keras.layers.MultiHeadAttention  |  TensorFlow v2.12.0

The assignment in deep learning specialization (which is almost same as the tensorflow tutorial code) uses the above tensorflow mha layer. If we look at the source code - keras/ at v2.12.0 · keras-team/keras · GitHub, Inside the mha layer, the three matrices of size d_feature are first fed into a dense layer. The scaled dot product attention is taken for the results of the dense layer.

In the NLP C4 W2, the mha layer is not used but a user defined function is used for calculating the muti-head causal attention. In the function, the matrices (after vertical partitioning) are not fed into any dense layer. After we split the matrices into three (vertical partitioning), each of these do not undergo any dense layer computation. The function directly calculates the dot product attention on the partitioned matrices - this is what I am asserting as wrong.

If so, if the code is wrong, how can we get good results? :slight_smile:

Note that the there are several serial iterations of mha and feedforward. The result of the first mha is again fed into subsequent mha of these layers (frankly, this aspect I did not realize while studying andrew ng lectures - was assuming as if parallel :slight_smile: ). In NLP C4 W2 implementation, in the first iteration, there won’t be any dense layer. But since the results of the mha go through the mha dense layers of the subsequent iterations, we may be ending up with good results. But still this won’t be the correct implementation.

Hi @nmurugesh

I must admin I’m not following your logic. Can you pinpoint the exact places you think the implementation is different?

The way I understand this:

Is exactly the same as this:

Dense_512 is exactly the same as self._query_dense()
DotProductAttn_in3 is the same as self._comput_attention

What is missing?

Hi @arvyzukai, in the code Branch_out3 [Dense_512,AttnHeads], the reshaping into multiple heads is happening in the AttnHeads function which receives input from Dense_512. But in the tensorflow mha layer, is there such a reshaping after dense layer before calculating the dot product attention? I don’t think so.

The d_heads are inputs to the mha layer in the tensorflow and hence to the dense layer of the mha in the code excerpt, but in our case, the d_heads are created after dense layer and before dot product attention. This is the difference.

It does have a lot of implication. When we vertically partition the matrices as we do in the compute_attention_heads function of NLP assignment, the query, key and values belong to the same word i.e. the relationships and importance of various segments of embeddings of same word is learnt through the multi-head attention. But if we use the tensorflow mha layer code, the relationships and importance are learnt between words through mha mechanism

Had some time to look at Tensorflow implementation - you are right, the reshape is done prior self._query_dense(query) but it should not make any difference because the Einsum takes care of it (check the Course 3 Week 2 Lab: Lecture Notebook: Hidden State Activation calculations)

Again, I think you are here mixing up “Initial embeddings” with “Q”. Do you realize that the slicing is happening on Q matrix and not the “Embedding matrix + Positional Encoding”? Main point - each head gets its own d_head column (or row) allocation in the Q matrix (it doesn’t matter Tensorflow or NLP Assignment - it doesn’t matter if it is split up or not).

“Vertically” or “horizontally” depends on your Tensor - if you switch the last two dimensions, suddenly vertical becomes horizontal and vice versa. In other words “[N ,H]” vertical split is the same as " [H, N]" horizontal split. Or are you suggesting that the split is done differently?

P.S. I’m out for the weekend :slight_smile:
Best regards.

Hi @arvyzukai, I am also not familiar with einsum based dense layer calculations. But the tensorflow documentation gives an example for MHA. The documentation tf.keras.layers.MultiHeadAttention  |  TensorFlow v2.12.0 says “…This layer first projects query , key and value . These are (effectively) a list of tensors of length num_attention_heads , where the corresponding shapes are (batch_size, <query dimensions>, key_dim) , (batch_size, <key/value dimensions>, key_dim) , (batch_size, <key/value dimensions>, value_dim)

This means the output of the dense layer itself gives multiple heads. But in the NLP assignment code, the output of the dense layer is converted into multiple heads which is wrong. To put it another way, the number of heads is an input to the Dense layer as can be seen the code excerpt given by you :slight_smile: But in our code, the input to the dense layer is d_model, the number of heads is used only in the subsequent function.

Do revert back if you still think my understanding is wrong.

I summarize here the difference:

Let us say the batch size is [2,1024] - two sentences containing a total of 1024 words each. Let us say we need 8 heads and the embedding depth of embedding layer is 512

  1. The Dense layer should learn the Q,K,V matrices for dimension
    [B,T,N,H] → [2,1024, 8, 512]

In the NLP case, the QKV matrices are learnt only for the size [2,1024,512].

The difference is that each word is represented by multiple heads - say, 8 X 512 in MHA layer with 8 heads of embedding dimension 512 each

The last dimension H is key_dim in the case of MHA; it is same as embedding dimension. But in the case of NLP assignment, it is embedding_dimension / n_heads. This means the dense layer does not learn weights for each head.

  1. Learning the Q matrix from (embedding+position encoding matrix) for a size of [2,1024,512] and then reshaping it to the size [2,1024,8,64] and then doing dot product attention cannot be considered the same

Doing dot product on reshaped matrices as done in NLP assignment implies attention is not calculated between different heads of each word but within each word. In terms of theoretical understanding, each head represents different queries and answers and we need to find out the relationship, position and importance of one head over other. But when we do attention without learning different heads, we are calculating attention within each head.

Hi @arvyzukiai, I am sorry for the inconvenience. On further study, I find that the only difference seems to be the dimension of the each head. In the case of tensorflow MHA layer, the dimension of each head is same as d_model. But in the case of NLP assignment implementation, it is d_model / n_heads. As I had mentioned in the previous reply, it is true that the weights are learnt for each head of dimension equal to d_model. In the case of NLP assignment, the weights are learnt for d_model since total dimenions of the multihead is also d_model. Hence, the model is producing equivalent results.

Hence, the only difference between MHA layer implementation and NLP assignment is the dimension of heads.

1 Like

Hey @arvyzukai,
I haven’t tried comparing these myself, and at present, I am not very well-versed with this concept. I might need some time to dive into this, and I need to submit my thesis this week. Someday later, I will experiment with the 2 layers in a notebook, and will share my insights.

Hey @nmurugesh, apologies if I infused any confusion in this thread. I just shared what I thought could be the reason for this confusion :smile:


1 Like

Hi @Elemento, Thanks for your replies also. I just checked some sample summarization tasks with our transformer summarizer vs chatgpt as I was going through chatgpt prompt engineering short course! I found the results are almost same!! If the implementation is incorrect, atleast a minor quality difference should be there, I thought. So, I rechecked everything - though my contention that the weights of each head should be learned is correct, this is also happening in NLP assignment - it is just that the dimension of the heads are smaller. Anyway , thanks a lot for both of you for sharing you time

1 Like