WEEK 4 Stuck in the final transformer asignment

Amir_Pasagic · May 8, 2024, 7:49am

I got a bit stuck in the following asignment in week 4, specifically on the part with defining the Encoder:

I was convinced that Q, V and K vectors/matrices share the same dimension. What does S represent in that case?

I see that input x is passed which I assumed is the sentence in form of the list of embeddings, so I assume dimensions (batch_size, max_sentence_len, source_embedding_dim).

However, MHA expects K,Q,V matrices and I do not understand how to convert then since I do not see matrices Wq, Wk, and Wv available anywhere. Am I overlooking something?

Also I am a bit puzzled by the sentence “Remember that to compute self -attention Q, V and K should be the same”.

I understand that MHA is just a repeated self-attention (with different W matrices), but wouldn’t in that case be just one tripple of (Wk,Wv and Wq) instead of several i.e. one for each “attention head”.

Also I am a bit confused by the dimensions:

query: Query Tensor of shape (B, T, dim).

value: Value Tensor of shape (B, S, dim).

key: Optional key Tensor of shape (B, S, dim). If not given, will use value for both key and value, which is the most common case.

TMosh · May 8, 2024, 7:57am

For self-attention, you use ‘x’ for all three parameters K, Q, and V.

Amir_Pasagic · May 8, 2024, 8:44am

Oh, TMosh, Hi again

So x isnt a list of word embeddings but rather a matrix that is both K, Q and V and is precomputed from the embeddings somewhere before?

The same is later added to the normalized output of the self attention?
I try to imagine it as self-attention provides a slight change in the value of X and that small “delta” is added to the vector X absolute value.

Amir_Pasagic · May 8, 2024, 8:51am

Also could you perhaps help me understand, why is K,Q and V same for self-attention. It goes against my intuitive understanding of self attention is.

TMosh · May 8, 2024, 5:21pm

I find that all of the intuitive explanations in the lectures about query, key, and value are entirely misleading.

It’s a very puzzling technique.

No, that’s not what happens.

Here’s another intuition that is probably also incorrect. Maybe one can look at self-attention as computing the similarity or correlation between all the words in the ‘x’ matrix.

Amir_Pasagic · May 12, 2024, 5:03pm

Personally, I dont find it hard to find an intuition behind the concepts of Q,K,V matrices. 3b1b had a fantastic video on attention, for these two seek intuition behind it.

How I see it - It is basically computing relevance one word has in the context of the other, based on the vector similarity between corresponding Q and K and then scaling vector V by this ammount and enriching the original vector with this additional semantic meaning. Multihead attention is repeating this number of times with different parameters, so different Q,K,V will be computed for same input sequence, thus allowing to find different context by which two or more words can relate, cause this is never singular.

All of this relies on the fact that these vectors reside in a sufficiently high dimensional space.

I fail to understand why does Keras implemetation documentation claim that key, query and value matrices all have to be set to same matrix in case of self-attention.

It goes against what I read and watched about attention as concept.
(If I understood well, MHA is just repeated self-attention with different Q,K,V pairs)

I try to imagine it as self-attention provides a slight change in the value of X and that small “delta” is added to the vector X absolute value

What I meant when I refered to this is the ADD & NORM part below where result of the attention is added to the original vector.

Way how I interpreted this is that a attention layer adds a change to vector X, adding a small delta value that is derived from the context of other words in the input sequence, rather then directly outputing modified vector itself. Perhaps this is computationally easier for some reason.

Topic		Replies	Views
Q,K,V all are same for self attention Sequence Models	5	639	November 19, 2023
Course 5 - Week 4 - A1 - Exercise 4 - EncoderLayer Sequence Models week-4	2	35	August 13, 2024
[Week 4] - Lab - Self Attention Sequence Models	1	625	June 4, 2021
Self-attention in the Transformer Network Sequence Models week-4	7	63	August 15, 2024
C5 W4 A1 EncoderLayer arguments for self.mha Sequence Models	4	580	May 18, 2023

WEEK 4 Stuck in the final transformer asignment

Related topics