Can someone help me understand why changing the number of heads does not affect the output dimension? As far as I know, the lecture states that the heads are concatenated, which should impact the output dimension. Thank you in advance! [Screenshot 2024-12-28 at 4.12.02 PM] [Screenshot 2024-12-28 …

W4 A1 number of heads vs

Course Q&A Deep Learning Specialization Sequence Models

Deepti_Prasad December 29, 2024, 4:24am 2

concatenation of head is to the weight, which if you see when head(i) is changed the attention weight is applied to all 3 multinhead attention, allowing the relative q, k and v value to be selected and getting the same output as the head(i) output.

Topic		Replies	Views
Clarification of definitions in transformer model Sequence Models	1	510	December 17, 2021
Transformers EncoderLayers, Multi-Head attention or Self-Attention? Sequence Models	1	1056	July 5, 2021
Question about multi-head attention Sequence Models	2	624	June 25, 2021
Self attention and redundancy NLP with Attention Models week-2	2	601	March 23, 2023
A question of Transformer Sequence Models	1	492	December 3, 2021

W4 A1 number of heads vs

Related topics