C5W4 Query analogy for weight matrices

In the videos on self-attention and multi-head attention, Andrew uses an analogy to help us understand W^Q and the other weights matrices W^K and W^V.

He says that W^Q asks a query, such as “what’s happening there?”. But really, W^Q has one row per word in the vocabulary, right? (as it is multiplied with the one-hot vector x) So, doesn’t that mean that W^Q can ask a completely different query for each word? (edit: perhaps I’m wrong and x already represents an embedding. So then I suppose the queries have as many degrees of freedom as the number of word embedding dimensions. Something like that… I don’t think it changes my question.)

If so, isn’t it a bit misleading to say that, in multi-head attention, W^Q_1 represents “what’s happening” and W^Q_2 represents “when”? And when @ reinoudbosch writes here that every matrix W^Q represents its own universe of meanings:

Does that really make sense? Why would the queries for the different words be sorted into meaningful sets? I.e. isn’t each matrix just a random bundle of queries, one per word, where multi-head attention just buys us the fact that we get to have more than one query per word?

Am I missing something? Thanks!

Clara

First of all, I recommend to go through this thread.
I think, due to a limited video length, Andrew’s talk is focusing on “intuition” so that majority of audiences get some senses of Transformer, and is not enough to cover whole Transformer technology.

Thanks! I’m still not sure whether I’m right or wrong about the specific problem with this intuition, though.

Does it really make sense to say, as Andrew does, that the whole of W^Q can be thought of as just one “query” that we ask about each word? W^Q itself does not vary by word or position, and why would we want to ask “what’s happening there?” about “Jane”?

And if instead I’m right to say that each W^Q is just a bundle of queries, one per word, then there’s no reason for those queries to have any common meaning across words within heads, right? Except I guess for the fact that all the rows in W^Q ultimately have to work with the same v<1>, v<2>,… as their answer/value. So maybe in different heads, we’ll get very different value matrices, some of which specialize in answering particular types of queries.

Does this all make sense?

I think the link covers mostly, but try to elaborate key points along with your question.
I think the word of “Query” is misleading in Transformer, but, that is “given” definition. So, we need to start with that.
As I wrote in this thread, Q is “target” and K/V is “source” to create mapping among them. Mapping can be for translation among multiple languages or relationship of words in a single language. The term of “query” is slightly misleading, since it is not a real query like SQL. :slight_smile:

the whole of W^Q can be thought of as just one “query” that we ask about each word?

As you see, Q (or even a dot product of Q and W^Q) consists of multiple words in a mixture of “word embedding” and “position encoding”, and is dispatched into multiple attention heads. The Important thing is not as one “query”, i.e, “whole sentence”, but is "each word vector like q^{<1>}W_1^{Q} to be associated with words in a source list like k^{<1>}W_1^{K}, k^{<2>}W_1^{K}, k^{<3>}W_1^{K}, ..., and creates relationships of among those, which are called “attention weights”.

W^Q itself does not vary by word or position, and why would we want to ask “what’s happening there?” about “Jane”?

It is not clear for me whether your W^Q means W^{Q} or a dot product of W and Q, but we should look at each word like the previous question.
Each query entry, i.e, word vector, includes “word embedding” and “position encoding”. So, “similarity” to other words can be calculated by dot product, cosine similarity. As you learned at seq-to-seq translation, word embedding is a key vector to represent the word meanings. In addition, we need to consider the position information which is typically lost in old RNN. With this, the Transformer can learn the first word “Jane” in Q can be associated with 2nd word, “visited” in K. There is no “what’s happening there?” types of query in here. That’s Andrew’s “intuition” talk.

each W^Q is just a bundle of queries, one per word, then there’s no reason for those queries to have any common meaning across words within heads, right?

QW^{Q} includes multiple words which consists “word embedding” and “position encoding”. Each heads receives part of q{<i>}W_h^Q, which may be part of word embedding or position encoding, but, receives all words. So, each head can create associations among words in Q (and K).

all the rows in W^Q ultimately have to work with the same v<1>, v<2>,… as their answer/value.

If that is for self-attention, we will create associations among words in Q and K, but all are weighted. And weights are learnable.

So maybe in different heads, we’ll get very different value matrices, some of which specialize in answering particular types of queries.

Each heads has slightly different attention weights, as the first head receives part of word embedding, and the second head receives part of position encoding. But, all has mapping for whole word. You can see examples at Course 5 Week 4 Assignment: Why are attention weights returned in DecoderLayer.

When, I took this course a few years ago, there is no Transfomer chapter. So my starting point was a paper. I believe you went through it, but, if not, please take a look. It’s a paper, but contents is not a classic paper, but is like Web article. Easy to read.

Then, I need a cup of coffee… :tea:

Hi, and thanks for the great questions claravdw!

Here’s my two cents.

The way Andrew presents the queries could be seen as a simplification done to make transformers understandable to learners for whom ‘meaning feature’ of ‘sets of meaning features’ may be too abstract. I think Nobu_Asai provides a great description of (part of) the forward pass in his last answer, but I think it’s important not to forget that there’s also a backward pass through which sets of meaning features deriving from the (desired) outputs are passed backward to parameters determining queries, keys, and values. Because multiple multi-headed attention layers are used, abstract representations of meaning come into being within the overall transformer, that - if one wishes - may be referred to as universes of meaning. Splitting attention heads to some extent keeps such universes of meaning separate, which empirically has been seen to improve effectiveness of the model.

Does this make sense to you?

Thank you both for your clarifications. I read the Attention Is All You Need paper, and I came out thinking that this has something to do with the projection step, where before we apply the attention step, we also scale down the key, query and value matrices to lower-dimensional versions (by multiplication with learned weight matrices).

So maybe for each head, we end up with a projection that represents one particular meaning aspect. I’m not 100% clear on the math, but that is my intuition on why different heads (each with their own query, key and value projections) represent different meanings.

I think you are catching good points.

As a computer engineer/programmer, I may not like is…

So maybe for each head, we end up with a projection that represents one particular meaning aspect.

A computer does not know what it means, but does record which appears what position as weights. From “intuition”, some may say “meaning”, but I can only say the above.
If you look at weights extracted from 8 MHA in the decoder, i.e, to show the relationship among K/V from the encoder (Portuguese in this case) an Q from the decoder (English in this case).

Here is an example.

In this example, there are 8 heads. Each has slightly different weights (attentions) among words. After applying Softmax, then, all are concatenated. This is a reasonable approach to cover longer sentence from different view points. I can not say that is “meaning aspect”, but would say “association/relationship” aspect. :slight_smile:

Hi claravdw and Nobu_Asai,

Thanks for a really interesting discussion. I feel that reacting to each other in a thread will help to clarify parts of the workings of the transformer architecture but may fall short of a comprehensive understanding.

How would you feel about trying to attain a comprehensive understanding of the transformer architecture from a philosophical, conceptual, and mathematical perspective using a googledoc document as a basis? We could then work towards a publishable document that could be linked, one way or another, to the deeplearning.ai community. I think there’s a lot we could learn from this process and a publishable document could be very useful to other learners.

If you are interested, let me know.

Hey @reinoudbosch, I think that would be a really helpful exercise. Unfortunately, I don’t have the bandwidth for it at the moment. :confused: I’m sorry!

Hi claravdw,

Thanks for your reply. It’s a pity you don’t have the time, but fully understandable of course. In any case, thanks for asking the right kind of questions about the transformer architecture.

I think there is a lot to gain from trying to achieve a fuller understanding. This could also include a more comprehensive understanding of the nature of intelligence, which is in some transformer-based architectures already leading to the use of additional components such as memory and control units and interactive components to keep humans and the internet/intranet in the loop. Fascinating stuff.

In case someone is still interested in this thread nine months later, and just also for myself… I’ve taken a deeper look into this, and here is a very brief summary of what I’ve concluded about my specific question. i could be wrong, of course–all I’ve read is the Attention Is All You Need paper plus every tutorial and chapter on transformers I could find, but not any further academic papers. :slight_smile:

  • yes, the x vectors (word representations) will typically be embeddings and not one-hot vectors. No, it doesn’t change my question. In fat, I found it helpful to think about what the meaning of WQ and WK would be if the vectors were in fact one-hot vectors, because the problem is the same.
  • Every WQ will just be a set of “queries” that you can ask (one query per row), and multiplication of the word vector with that WQ will result in a weighted sum of those queries. So a multiplication of the embedding of “Africa” with WQ1 might result in a mix of the queries (WQ1 rows) “what happens there” and “why do people go there”? Whereas a multiplication of the embedding of “Africa” with WQ2 might result in a mix of the queries (WQ2 rows) “who lives there” and “who goes there”?
  • it makes sense to say that every WQ represents its own universe of meanings only to the extent that every WQ will contain a series of queries “which would make sense to sum together given certain embeddings”. This is a very loose definition of “sharing meaning”. It’s not a super helpful metaphor I think, and I’d rather think of WQ as a list of queries, and by having more WQ’s we just get to have several lists of queries.
  • I don’t think the projection step has anything to do with this, so I was wrong when I proposed that.

Hope this helps someone!