If I understand you correctly, then this number would be 32. In other words,
- if the embedding dimension is 32 and we have 1 head, then the head dimension is 32;
- if the embedding dimension is 32 and we have 4 heads, then the head dimensions is 8;
To illustrate what happens with more concrete steps (this time with embedding dimension 16):
For example, input is “word1 word2” → tokenize(input) → [2, 54] → Embed([2, 54]) → 2D matrix of shape (2, 16):
[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]]
*Note: the values here are just indexes, as the real values would be more normal (like 0.23, -0.10 etc.)
→ get_Q()=Embed([2, 54]) \cdot W_q → Q :
[[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15],
[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15]]
→ get_K()=Embed([2, 54]) \cdot W_k → K:
[[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15],
[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15]]
→ get_V()=Embed([2, 54]) \cdot W_v → V:
[[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15],
[0,1,2,3,|4,5,6,7,|8,9,10,11,|12,13,14,15]]
*Note: “|” is just for illustration purpose, in reality it would be a simple matrix as before but later split into a 3D tensor (for each head according to these “|”) (in this week’s assignment, these splits (crazy
reshapes and transposes) happen in compute_attention_heads_closure. So, “|” indicates each head’s “working” dimensions or splits (4 numbers here, 8 in the previous post, 3 in your pictures).
Now when you get attention weights (by softmax(Q \cdot K^T) you can represent them like a 3D tensor:
[
[[0,1], # Q_0_0 @ K_1_0^T, Q_0_0 @ K_0_0^T
[0,1]], # Q_1_0 @ K_1_0^T, Q_1_0 @ K_0_0^T
[[0,1],
[0,1]],
[[0,1],
[0,1]],
[[0,1],
[0,1]],
]
When you apply this attention on V (attention \cdot V) the output is Z_1, Z_2, Z_3, Z_4 which after concatenation (again “crazy” reshapes and transposes in compute_attention_output_closure) becomes a 2D matrix of shape (2, 16) (like in your bottom picture). Also note that order of these attention dimensions do not match Assignment but are just for illustration.