Conceptual Questions about Transformers

  1. What exactly is the output of the encoder? Is it just the K and V weights?

  2. Do we need to pass a translated sentence into the transformer, as an input? I thought the output of the transformer was the translated sentence. However, the lecture diagram and assignment require the translated sentence to be inputted into the transformer (or I may be misreading it).

  3. Why did we need to save the weights of the attention, in the assignment? We don’t end up using them.

  4. Plus, why do we only save weights for the decoder section? Why not the encoder too? The encoder also uses attention.

  1. Please see the next point.
  2. The encoder output we want to pass to the decoder are the modified embeddings of the encoder terms. Look at class Encoder for more details regarding this. Here’s a recap:
    a. Perform dot product self attention to understand the similarity between encoder terms.
    b. Use this information to update embeddings before passing them to the decoder.
    1.We need to pass the translated sentence as input during training. The expected output is one step right shifted translated output. This is done so that we can calculate loss for all terms in parallel. Please go back to the lecture(s) to understand look ahead masking for the role it plays in decoder attention.
  3. The weights are used in grading. See scaled_dot_product_attention_test for details.
  4. Decoder weights are also used for grading. See Decoder_test and Transformer_test
1 Like

Thanks for your response. This has cleared up a lot. However, I still have some questions related to your answer.

  1. What do you mean by “encoder terms”? Is this the sentence to be translated (and tokenized)?

  2. In the first paragraph of section 4 of the assignment, it says that the encoder actually outputs the matrices of K and V to pass into the decoder. However, the implementation only shows the encoder outputting the encoded sentence to the decoder. So, I’m confused by how exactly the K and V matrices are passed to the decoder.

  1. Consider using a transformer for french to english translation. Encoder receives the tokenized french sentence provided by the user as input.
  2. Encoder outputs only 1 term of interest. Use this output for both K and V. Please look at the decoder figure below and observe the same line connecting K and V:

So, in your original response, when you said “modified embeddings”, were you referring to the K and V matrices?

I was referring to the output of the encoder which is also the same as K and V.

1 Like

I’ve just finished the assignment, but I ve still a question concerning W^Q, W^K, W^V.
I am wondering if they are different for each self-attention layer?
My guess would be that they are different, but it’s not explicitly shown on the schema.
Thanks in advance

Have you tried using equal and reduce_all to answer your question?

Hi @py_0153 ,

When I’ve implemented transformers with Multi-Head-Attention, in the definition of the self-attention class I instantiate a Q, K, V for each self-attention. Something of this sort:


class SelfAttentionHead(nn.Module):
   def __init__(self, head_size):
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
   def forward(self, x, mask=True):
        k = self.key(x) 
        q = self.query(x)
        v = self.value(x)

As you can see, in this class I am declaring Q, K, V which will be instantiated in the forward.

After I define this class, I define later the MULT_HEAD_ATTENTION class like so:


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        self.heads = nn.ModuleList([SelfAttentionHead(head_size) for _ in range(num_heads)]) 

    def forward(self, x, mask=True):

        return out

From this class I instantiate the ‘num_heads’ self-attention heads.

This of course, when you use a library, will be done by the library, but here I am showing you how I have implemented it from the ground up.

In conclusion: In my implementation, and I would say that in libraries implementations, each self-attention head will have an instance of Q, K, V. Since each head has its own set of weight matrices, the resulting Q, K, and V matrices will be different for each self-attention head. The purpose of having multiple heads is to allow the model to capture different aspects of the input data by focusing on different parts of the input sequence.

Thoughts? Questions?



Thanks @balaji.ambresh Balaji for your answer,
I am not sure where should I fine the W^Q, W^K, W^V,
I didn’t dig into the tf libraries yet.

Thanks @Juan_Olano Juan for your answer,
I 've found that my question was not clear at all.
In fact I am wondering if in each layer of the Multihead-Attention the W^Q, W^K, W^V, W^O are differents. I am concern with the layers rather than the head, you made it clear and in the course of prof Ng, it is clear also that the head are different.
What about the W^Q, W^K, W^V, W^O in each layer?
Many thanks

Let’s call

  1. MHA = MultiHeadAttention
  2. Encoder layer = a single encoder layer
  3. Encoder block = stacked encoder layers
  4. Follow similar terminology as encoder for decoder.

There is no parameter sharing across layers. Here’s MHA

This is the encoder layer:

Inputs to MHA layer inside an encoder layer are the same value for Q, K and V. Values that get passed out are different across encoder layers and so the values to MHA across different encoder layers are different.

Here’s the decoder layer from an earliear reply
For the 1st MHA, the same input gets passed in as Q, K and V.
For the 2nd MHA, the encoder block output is passed as K and V. Q comes from the output of layer norm of 1st MHA from within the same decoder layer.

Looking at a decoder block, the same K and V are used in 2nd MHA across all decoder layers. These are generated from the encoder block. Since the output of the previous decoder layer is used as input to the next decoder layer, input is different across decoder layers.

1 Like

Thanks you Balaji to take time to answer my question.
I understand that. I am asking about the matrix W^Q, W^K, W^V
Are they different in each block?
Thanks a lot for your answer.

Each multi-head-attention block is built of 1 or more self-attention heads. Each self-attention head has it own weights.

Thanks a lot Juan, I really appreciate it.
I do understand that each attention head in a block has its own weights, which I would write it mathematically like this :
(I wrote it just for the query but of course it holds for key and value)
Now, between block, can I write this :
Thanks in advance for your answer.