Sequence length, d_head variable

Hello, I am trying to understand the sequence length variable from the assignment

I reviewed the reading on “Multihead Attention”

I am guessing sequence length is how many words are in a sentence. if I use the picture from the reading, that would be n, if I have word1, word2, … word n

I am also not quite sure about the d_head variable in my coding pic (1st pic), is it the number of columns for Z in the pic below? which largely depends on the Wo?

Hi @Fei_Li

Yes. In the picture the sequence length is 2 it would be padded (to like 32) in a more real scenario.

Yes, the d_head highly depends on Wo or more generally on embedding dimension - if the embedding dim is 32 and there are 4 heads, then the d_head would be 8.

Each head “operates” in its own vector of 8 numbers, then these numbers of each head are concatenated - back to vector of 32. The subsequent linear transformation Wo is applied on to it to get Z.

1 Like

Sure, thank you. Based on your explanation, I also want to make sure that I understand “d_model” in multi-head attention. Appreciate your help.

In the picture [Z1 Z2 …Zn] *Wo=Z, there are n heads, right? Then say we have a sequence of words, we apply a different set of WQ, WK, and WV to get a different head.

So if I use your example, that would be for each word we use 4 heads, and for each head, we use 8 numbers, which is d_head. This 8, should be d_model if we don’t use multi-head right? That is we use 8 numbers to represent a word before the multi-head mechanism.
But now, since we are using the multi-head mechanism to represent a word, now each word is represented by 32 numbers(our new d_model), which comes from 4 heads, each of them having 8 numbers.

If I understand you correctly, then this number would be 32. In other words,

  • if the embedding dimension is 32 and we have 1 head, then the head dimension is 32;
  • if the embedding dimension is 32 and we have 4 heads, then the head dimensions is 8;

To illustrate what happens with more concrete steps (this time with embedding dimension 16):

For example, input is “word1 word2” → tokenize(input) → [2, 54] → Embed([2, 54]) → 2D matrix of shape (2, 16):

[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]]

*Note: the values here are just indexes, as the real values would be more normal (like 0.23, -0.10 etc.)

→ get_Q()=Embed([2, 54]) \cdot W_q → Q :

[[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15],
 [0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15]]

→ get_K()=Embed([2, 54]) \cdot W_k → K:

[[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15],
 [0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15]]

→ get_V()=Embed([2, 54]) \cdot W_v → V:

[[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15],
 [0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15]]

*Note: “|” is just for illustration purpose, in reality it would be a simple matrix as before but later split into a 3D tensor (for each head according to these “|”) (in this week’s assignment, these splits (crazy :slight_smile: reshapes and transposes) happen in compute_attention_heads_closure. So, “|” indicates each head’s “working” dimensions or splits (4 numbers here, 8 in the previous post, 3 in your pictures).

Now when you get attention weights (by softmax(Q \cdot K^T) you can represent them like a 3D tensor:

[
 [[0,1], # Q_0_0 @ K_1_0^T, Q_0_0 @ K_0_0^T
  [0,1]], # Q_1_0 @ K_1_0^T, Q_1_0 @ K_0_0^T

 [[0,1],
  [0,1]], 

 [[0,1],
  [0,1]], 

 [[0,1],
  [0,1]], 
]

When you apply this attention on V (attention \cdot V) the output is Z_1, Z_2, Z_3, Z_4 which after concatenation (again “crazy” reshapes and transposes in compute_attention_output_closure) becomes a 2D matrix of shape (2, 16) (like in your bottom picture). Also note that order of these attention dimensions do not match Assignment but are just for illustration.

Thank you very much for your detailed explanation.

I guess I am getting there.

then this time, d_model is 16.

this “54” is how many numbers we use to represent a word which we call embedding size?

I am getting there, right? :thinking:

Yes

No, this 54 is a random integer value that I thought for “word2”. In other words “word2” tokenize(“word2”) → 54, tokenize “word1” → 2

I think so :wink:

Oh, okay okay.

Thank you very very much for your help. :+1: