Transformers (Multi-head Attention) question

Hello,

I have an question here


I start doing transformer from scratch like this paper [1706.03762] Attention Is All You Need (arxiv.org), here we divide the d_models to sub d (dk) so I think that this implementation didn’t has large different of single head attention, because sub parameters didn’t see the full d_model with represent the embedding so i decide to do that image below implementation and it’s work correctly, does any one do that and had see large difference ?
( I run small numbers of steps of model i have and didn’t see big difference in loss and time, i think also if i contiune running the full epochs i think i will see some enhancement in accuracy, I didn’t contiue full run because the resources that i have :sweat_smile: )

Best Regards,
Abdelrahman

In Single-head attention model, the size of the hidden embedding is the same size of the model’s embedding, that is, (model_dim, embed_dim). In this case, the Q, K, V will maintain internal matrices of this size.

In Multi-head attention model, the size of the hidden matrices inside each head are a fraction of the model’s embedding. In particular, we take the model_dim and divide it by the number of heads:

model_dim / heads

So the internal matrices of each head will be of size (model_dim/h, embed_dim).

The input to each head is the whole model embedding but this is transformed to the the reduced size, processed internally, and become the output of each head. Then all heads’ outputs are concatenated and we get back the original size.

Inside each head we received the full model embedding, transformed it linearly to its new size, applied the attention formula, and created the output. So each head does work with the entire model information but in a reduced dimensional space.

Thoughts?

2 Likes

Hi Mr,@Juan_Olano

This is the original implementation

class MultiHeadAttentionBlock(nn.Module):

    def __init__(self, d_model: int, h: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model # Embedding vector size
        self.h = h # Number of heads
        # Make sure d_model is divisible by h
        assert d_model % h == 0, "d_model is not divisible by h"

        self.d_k = d_model // h # Dimension of vector seen by each head
        self.w_q = nn.Linear(d_model, d_model, bias=False) # Wq
        self.w_k = nn.Linear(d_model, d_model, bias=False) # Wk
        self.w_v = nn.Linear(d_model, d_model, bias=False) # Wv
        self.w_o = nn.Linear(d_model, d_model, bias=False) # Wo
        self.dropout = nn.Dropout(dropout)

    @staticmethod
    def attention(query, key, value, mask, dropout: nn.Dropout):
        d_k = query.shape[-1]
        # Just apply the formula from the paper
        # (batch, h, seq_len, d_k) --> (batch, h, seq_len, seq_len)
        attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            # Write a very low value (indicating -inf) to the positions where mask == 0
            attention_scores.masked_fill_(mask == 0, -1e9)
        attention_scores = attention_scores.softmax(dim=-1) # (batch, h, seq_len, seq_len) # Apply softmax
        if dropout is not None:
            attention_scores = dropout(attention_scores)
        # (batch, h, seq_len, seq_len) --> (batch, h, seq_len, d_k)
        # return attention scores which can be used for visualization
        return (attention_scores @ value), attention_scores

    def forward(self, q, k, v, mask):
        query = self.w_q(q) # (batch, seq_len, d_model) --> (batch, seq_len, d_model)
        key = self.w_k(k) # (batch, seq_len, d_model) --> (batch, seq_len, d_model)
        value = self.w_v(v) # (batch, seq_len, d_model) --> (batch, seq_len, d_model)

        # (batch, seq_len, d_model) --> (batch, seq_len, h, d_k) --> (batch, h, seq_len, d_k)
        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1, 2)
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1, 2)
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1, 2)

        # Calculate attention
        x, self.attention_scores = MultiHeadAttentionBlock.attention(query, key, value, mask, self.dropout)
        
        # Combine all the heads together
        # (batch, h, seq_len, d_k) --> (batch, seq_len, h, d_k) --> (batch, seq_len, d_model)
        x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.h * self.d_k)

        # Multiply by Wo
        # (batch, seq_len, d_model) --> (batch, seq_len, d_model)  
        return self.w_o(x)

ok i understand that each head is the whole model embedding but this is transformed to the the reduced size, but what do you think about we didn’t reduce size as if we reduce size of course we compress the values and meaning in small vectors or matrix this may lead to the loss of some information that may be important. I also know that gradient descentwill deal with this problem in one way or another, but it will not solve 100% of this problem. What do you think if we apply some thing like stack more that Single-head attention besides like have different prespective for each Single-head attention this image


we will not reduce the size, we will compute scores for each different head( Single-head attention) and in the each the model will have large matrix to choose from each head the part which has more meaningful(value)
like this implementation


class MultiHeadAttentionBlock(nn.Module):
    def __init__(self, d_model: int, heads: int, dropout: float) -> None:
        super().__init__()
        self.d_model=d_model
        self.heads=heads
        self.w_q=nn.Linear(self.d_model,self.heads*self.d_model)
        self.w_k=nn.Linear(self.d_model,self.heads*self.d_model)
        self.w_v=nn.Linear(self.d_model,self.heads*self.d_model)
        self.w_o = nn.Linear(self.heads*self.d_model, self.d_model) 
        self.dropout = nn.Dropout(dropout)
    
    
    @staticmethod
    def attention(query, key, value, mask, dropout: nn.Dropout):
        d_model = query.shape[-1]
        # Just apply the formula from the paper
        # (batch, h, seq_len, d_model) --> (batch, h, seq_len, seq_len)
        attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_model)
        if mask is not None:
            # Write a very low value (indicating -inf) to the positions where mask == 0
            attention_scores.masked_fill_(mask == 0, -1e9)
        attention_scores = attention_scores.softmax(dim=-1) # (batch, h, seq_len, seq_len) # Apply softmax
        if dropout is not None:
            attention_scores = dropout(attention_scores)
        # (batch, h, seq_len, seq_len) --> (batch, h, seq_len, d_model)
        # return attention scores which can be used for visualization
        return (attention_scores @ value), attention_scores
    
    
    def forward(self, q, k, v, mask):
        query = self.w_q(q) # (batch, seq_len, d_model) --> (batch, seq_len, d_model)
        key = self.w_k(k) # (batch, seq_len, d_model) --> (batch, seq_len, d_model)
        value = self.w_v(v) # (batch, seq_len, d_model) --> (batch, seq_len, d_model)
        
        
        # (batch, seq_len, d_model) --> (batch, seq_len, h, d_model) --> (batch, h, seq_len, d_model)
        query = query.view(query.shape[0], query.shape[1], self.heads, self.d_model).transpose(1, 2)
        key = key.view(key.shape[0], key.shape[1], self.heads, self.d_model).transpose(1, 2)
        value = value.view(value.shape[0], value.shape[1], self.heads, self.d_model).transpose(1, 2)

        # Calculate attention
        x, self.attention_scores = MultiHeadAttentionBlock.attention(query, key, value, mask, self.dropout)
        
        # Combine all the heads together
        # (batch, h, seq_len, d_model) --> (batch, seq_len, h, d_model) --> (batch, seq_len, d_model)
        x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.heads * self.d_model)

        # Multiply by Wo
        # (batch, seq_len, d_model) --> (batch, seq_len, d_model)  
        return self.w_o(x)

@AbdElRhaman_Fakhry ,

If I am understanding you correctly, you are proposing that each head computes attention in the entire input of the model, correct?

If this is the scenario you are proposing, then I see a couple of side effects:

  1. The computational power needed to handle it would grow proportionally to the size.
  2. Memory requirements would also grow in the same proportion.

Then after the heads output their attention matrices, what mechanism would you propose to handle these n-heads matrices? you mention “selecting the best of each”. Can you please expand again on this concept? Would it be through another round of attentions? would it be to reduce the size of these outputs so that a concatenation of all attention matrices would lead back to the size of the model’s input?

Now, once thing is probably true: may be the model will learn more complex relationships. I think that’s very possible. The question is: Is the trade off economically acceptable?

These are just some thoughts. Lets continue discussing this. It is interesting!

What do you think of my comments?

@Juan_Olano Thanks for your comments

yes, that’s correct each head will compute attention in the entire input of the model, that will lead to more powerful representations

if i understand you correctly,
By doing matrix multiplication with the output matrix ‘‘wo’’, and by doing gredient descent, the model will be represent more complex relationships than the previous(original) model because computes attention in the entire input of the model, and will return the same size of previous(original) model

(if you mean by another round of attentions == another layer of attentions) No we didn’t do another round of attentions

the reduction here in the last step like the previous(original) model but large matrix multiplication, but in all steps we didn’t do any reduce, we first compute attention scores for each head(each head has information about the entire input of the model) , and by multiply the attention scores with the values this will lead to large best selection of some important information from value matrix which has very, and more complex representations and relationships,(now we have result matix which has the best values of the value matix )and finally take the result and multiply it with the matrix ‘‘WO’’ which will reduce the size back to the size of the model’s input

yes i think so

Well, it seems you have a very clear experiment at hands. I suggest that you go for it, implement it, and share the results.

I still think that the extra computing power and memory requirements can make it so expensive in a big model, that may be the extra accuracy is not worth it.

But may be for small models this makes sense.

Try to define the rules of evaluation, so that your experiment is properly measured against the status quo.

I truly look forward for this experiment!

1 Like

@Juan_Olano

This is the model link on colab Google Colab , Also I run this model to create translation model and it ran successfully, But i didn’t run all epochs because of resources :sweat_smile: , Also I will run that and i will try to define the rules of evaluation between 2 of these, Thanks for the help and encouragement. I will keep you posted on everything I come up with

Congratulations @AbdElRhaman_Fakhry ! This is a great step! You had the will and the aim and made it happen.

I have a question:

Can you explain to me how you are handling the matrices that come out of each head? I was looking at the model and I missed it. Sorry about it.

I am seeing that you are instantiating 8 heads. This means that the output of the attention module will be 8 * model_size (8 matrices of model_size). However in the feedforward I don’t get to see how you are handling these 8 matrices.

Another question: you said you created a translation model - can you share it along with the dataset? I would like to run it for several epochs and see how it behaves. If you cannot share, no worries, I understand.

@Juan_Olano

The input of the MuliheadAttension class is the embedding plus postion encoding which is matrix of dimention(sequence length, d_model) like this image

after that i project the input in 3 things(query,key, value) for example i want to project the input into query so i multiply the input which is (sequence length, d_model) by matix called WQ (which has dimension is (d_model,heads * d_model)) using this command

 self.w_q=nn.Linear(self.d_model,self.heads*self.d_model)  # Linear transformation for queries

like this image


this will produce the query matrix which will be (sequence length,heads * d_model)

query = self.w_q(q)  # Apply linear transformation to queries

, and do the same thing for key, and value matrix, after that i compute the attension scores and attensions like this image
image
which is query dimension is (sequence length,heads * d_model) * the transpose of key (heads * d_model,sequence length), this will give us output dimension is (sequence length,sequence length) after apply softmax and mask we will have the same output matrix dimension is (sequence length,sequence length) we will multiply it with value matrix which is (sequence length,heads * d_model) this will lead to result matrix with dimensions (sequence length,heads * d_model) this is the attentions but here i didn’t do concatenations because i create abig matrix that handle that replace concatenation phase

#using this function
def attention(query, key, value, mask, dropout: nn.Dropout):
        d_model = query.shape[-1]  # Get the dimension of the model
        # Calculate attention scores using the scaled dot-product attention formula
        attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_model)
        if mask is not None:
            # Mask positions where the mask is zero by setting scores to a very low value
            attention_scores.masked_fill_(mask == 0, -1e9)
        # Apply softmax to obtain the attention distribution
        attention_scores = attention_scores.softmax(dim=-1)
        if dropout is not None:
            attention_scores = dropout(attention_scores)  # Apply dropout for regularization
        # Calculate the weighted sum of values based on attention scores
        return (attention_scores @ value), attention_scores

, after that i should do linear(forward layer to return matrix with same dimension as input matrix which is ((sequence length, d_model))) like this image
image

so i multiply the final output matrix which has dimension (sequence length,heads * d_model) with another marix called WO which has dimension(heads * d_model ,d_model )

 self.w_o = nn.Linear(self.heads*self.d_model, self.d_model)  # Output linear transformation

like this image


this will lead to final output is (sequence length, d_model) matrix

 return self.w_o(x)

Of Course I will share the translation model with you, but I will be nealy publish it as there are some thing i should finish first, Also if some thing isn’t clear in the previous answer please ask me about it

Interesting. How is WO matrix initialized and trained?

@Juan_Olano
This command intialize matrices

 # Initialize the parameters of the Transformer model
    for p in transformer.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)  # Initialize using Xavier initialization

    return transformer  # Return the constructed Transformer model

it’s in the end of the colab notebook, but the training optimization isn’t here it’s in another file i will share it with the whole model

So lets assume we have a transformer with a context of 1024 tokens, an embeddings matrix with embedding size = 728, and 8 heads in the attention mechanism.

This means that the model’s input is (1024, 728). We pass this matrix to each one of the 8 heads.

The output of each head will be a value matrix of (1024, 728), correct? And I have 8 of these matrices. You are concatenating them so you end up with a matrix of (1024 8, 728). Then you reduce it back to the model’s size by multiplying a WO matrix of dimension (1024, 10248) with the concatenated matrix of (1024 * 8, 728). This is what it looks like:

transformed_value_matrix = WO x concatenated_value_matrices

transformed_value_matrix = (1024, 10248) x (10248, 728)

transformed_value_matrix = (1024, 728) ==> This is again on the size of the model’s input.

Please let me know if my understanding is right or wrong.

Thank you!

@Juan_Olano
First thing the multiplication of linear layer is X * W not W * X

this is correct, after that we project this input into 3 things(query,key, value) by doing linear layer, so for example we want to project this input into query using matrix w_q to get query * heads=query * 8, so w_q matrix shape is (728,728 * 8 ), so the query matrix will be (1024 , 728) * (728, 728 * 8) = (1024, 728 * 8 )

after that we will have query, key,value matrices each of them has dimension (1024, 728 * 8 )

after that we compute heads using this equation
image
each head will have dimension (1024,728), we have 8 heads so the dimension of the heads will be (1024, 728 * 8 )

yes but the dimension of the WO matrix will be (728 * 8 , 728 ), we multiply the heads with the WO= (1024, 728 * 8 ) * (728 * 8 , 728 ) = (1024 , 728 )==> This is again on the size of the model’s input

You are right, it is (1024, 7288) and the WO is (7288, 728) and the product is X * WO. Thank you for the correction!

I was thinking about a different version to this model:

What if instead of using a WO matrix to reduce the dimension of X back to the model’s size, we used convolutions followed by a concatenation?

The convolution of each head_x would set it at the (model_size, embed_size//num_heads), and then the n_head matrices can be concatenated to arrive at the (model_size, embed_size).

By using convolutions, we are concentrating the main features of the output of each head and then keeping those concentrated features in the final attention output after concatenation.

What do you think about this variant?

@Juan_Olano
I think it was a very good suggestion, Also it’s first time i hear about using convolutions in multi head attention…I will look for it and tell you everything I find, but after I finish the model…but I am a little sick these days :sweat_smile: and also on vacation, so when I return, I will inform you of everything new .

Hey @AbdElRhaman_Fakhry get well first! I am sorry that you are sick.

I am creating the transformer with the convolutions - I will share my results soon. I think that by doing it I can capture and maintain the features found in each head. Good project!

1 Like