Sequence length, d_head variable

Fei_Li · April 27, 2023, 1:57pm

Hello, I am trying to understand the sequence length variable from the assignment

I reviewed the reading on “Multihead Attention”

I am guessing sequence length is how many words are in a sentence. if I use the picture from the reading, that would be n, if I have word1, word2, … word n

I am also not quite sure about the d_head variable in my coding pic (1st pic), is it the number of columns for Z in the pic below? which largely depends on the Wo?

arvyzukai · April 27, 2023, 5:20pm

Hi @Fei_Li

Yes. In the picture the sequence length is 2 it would be padded (to like 32) in a more real scenario.

Yes, the d_head highly depends on Wo or more generally on embedding dimension - if the embedding dim is 32 and there are 4 heads, then the d_head would be 8.

Each head “operates” in its own vector of 8 numbers, then these numbers of each head are concatenated - back to vector of 32. The subsequent linear transformation Wo is applied on to it to get Z.

Fei_Li · April 28, 2023, 2:40am

Sure, thank you. Based on your explanation, I also want to make sure that I understand “d_model” in multi-head attention. Appreciate your help.

In the picture [Z1 Z2 …Zn] *Wo=Z, there are n heads, right? Then say we have a sequence of words, we apply a different set of WQ, WK, and WV to get a different head.

So if I use your example, that would be for each word we use 4 heads, and for each head, we use 8 numbers, which is d_head. This 8, should be d_model if we don’t use multi-head right? That is we use 8 numbers to represent a word before the multi-head mechanism.
But now, since we are using the multi-head mechanism to represent a word, now each word is represented by 32 numbers(our new d_model), which comes from 4 heads, each of them having 8 numbers.

arvyzukai · April 28, 2023, 7:09am

If I understand you correctly, then this number would be 32. In other words,

if the embedding dimension is 32 and we have 1 head, then the head dimension is 32;
if the embedding dimension is 32 and we have 4 heads, then the head dimensions is 8;

To illustrate what happens with more concrete steps (this time with embedding dimension 16):

For example, input is “word1 word2” → tokenize(input) → [2, 54] → Embed([2, 54]) → 2D matrix of shape (2, 16):

[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]]

*Note: the values here are just indexes, as the real values would be more normal (like 0.23, -0.10 etc.)

→ get_Q()=Embed([2, 54]) \cdot W_q → Q :

[[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15],
 [0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15]]

→ get_K()=Embed([2, 54]) \cdot W_k → K:

[[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15],
 [0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15]]

→ get_V()=Embed([2, 54]) \cdot W_v → V:

[[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15],
 [0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15]]

*Note: “|” is just for illustration purpose, in reality it would be a simple matrix as before but later split into a 3D tensor (for each head according to these “|”) (in this week’s assignment, these splits (crazy reshapes and transposes) happen in compute_attention_heads_closure. So, “|” indicates each head’s “working” dimensions or splits (4 numbers here, 8 in the previous post, 3 in your pictures).

Now when you get attention weights (by softmax(Q \cdot K^T) you can represent them like a 3D tensor:

[
 [[0,1], # Q_0_0 @ K_1_0^T, Q_0_0 @ K_0_0^T
  [0,1]], # Q_1_0 @ K_1_0^T, Q_1_0 @ K_0_0^T

 [[0,1],
  [0,1]], 

 [[0,1],
  [0,1]], 

 [[0,1],
  [0,1]], 
]

When you apply this attention on V (attention \cdot V) the output is Z_1, Z_2, Z_3, Z_4 which after concatenation (again “crazy” reshapes and transposes in compute_attention_output_closure) becomes a 2D matrix of shape (2, 16) (like in your bottom picture). Also note that order of these attention dimensions do not match Assignment but are just for illustration.

Fei_Li · April 28, 2023, 12:44pm

Thank you very much for your detailed explanation.

I guess I am getting there.

then this time, d_model is 16.

this “54” is how many numbers we use to represent a word which we call embedding size?

I am getting there, right?

arvyzukai · April 28, 2023, 12:50pm

Yes

No, this 54 is a random integer value that I thought for “word2”. In other words “word2” tokenize(“word2”) → 54, tokenize “word1” → 2

I think so

Fei_Li · April 28, 2023, 12:53pm

Oh, okay okay.

Thank you very very much for your help.

Topic		Replies	Views
Multiheaded Attention - Number of heads and Dim of heads NLP with Attention Models week-module-2	13	4064	May 7, 2023
C5 W4 Attention (Q,K,V) Sequence Models coursera-platform	1	534	January 19, 2023
W4 A1 number of heads vs Sequence Models week-module-4 , coursera-platform	1	24	December 29, 2024
C4_W2_Assignment Multi-Head Attention input prep NLP with Attention Models week-module-2	1	546	September 19, 2022
Relevance of shape of Query tensor Q, K and V Sequence Models coursera-platform	10	1347	August 22, 2023

Sequence length, d_head variable

Related topics