Can someone help me understand why changing the number of heads does not affect the output dimension? As far as I know, the lecture states that the heads are concatenated, which should impact the output dimension. Thank you in advance!
concatenation of head is to the weight, which if you see when head(i) is changed the attention weight is applied to all 3 multinhead attention, allowing the relative q, k and v value to be selected and getting the same output as the head(i) output.