I got a bit stuck in the following asignment in week 4, specifically on the part with defining the Encoder:

I was convinced that Q, V and K vectors/matrices share the same dimension. What does S represent in that case?

I see that input x is passed which I assumed is the sentence in form of the list of embeddings, so I assume dimensions (batch_size, max_sentence_len, source_embedding_dim).

However, MHA expects K,Q,V matrices and I do not understand how to convert then since I do not see matrices Wq, Wk, and Wv available anywhere. Am I overlooking something?

Also I am a bit puzzled by the sentence “Remember that to compute self -attention Q, V and K should be the same”.

I understand that MHA is just a repeated self-attention (with different W matrices), but wouldn’t in that case be just one tripple of (Wk,Wv and Wq) instead of several i.e. one for each “attention head”.

Also I am a bit confused by the dimensions:

query: Query Tensor of shape (B, T, dim).

value: Value Tensor of shape (B, S, dim).

key: Optional key Tensor of shape (B, S, dim). If not given, will use value for both key and value, which is the most common case.

So x isnt a list of word embeddings but rather a matrix that is both K, Q and V and is precomputed from the embeddings somewhere before?

The same is later added to the normalized output of the self attention?
I try to imagine it as self-attention provides a slight change in the value of X and that small “delta” is added to the vector X absolute value.

Here’s another intuition that is probably also incorrect. Maybe one can look at self-attention as computing the similarity or correlation between all the words in the ‘x’ matrix.

Personally, I dont find it hard to find an intuition behind the concepts of Q,K,V matrices. 3b1b had a fantastic video on attention, for these two seek intuition behind it.

How I see it - It is basically computing relevance one word has in the context of the other, based on the vector similarity between corresponding Q and K and then scaling vector V by this ammount and enriching the original vector with this additional semantic meaning. Multihead attention is repeating this number of times with different parameters, so different Q,K,V will be computed for same input sequence, thus allowing to find different context by which two or more words can relate, cause this is never singular.

All of this relies on the fact that these vectors reside in a sufficiently high dimensional space.

I fail to understand why does Keras implemetation documentation claim that key, query and value matrices all have to be set to same matrix in case of self-attention.

It goes against what I read and watched about attention as concept.
(If I understood well, MHA is just repeated self-attention with different Q,K,V pairs)

I try to imagine it as self-attention provides a slight change in the value of X and that small “delta” is added to the vector X absolute value

What I meant when I refered to this is the ADD & NORM part below where result of the attention is added to the original vector.

Way how I interpreted this is that a attention layer adds a change to vector X, adding a small delta value that is derived from the context of other words in the input sequence, rather then directly outputing modified vector itself. Perhaps this is computationally easier for some reason.