concatenation of head is to the weight, which if you see when head(i) is changed the attention weight is applied to all 3 multinhead attention, allowing the relative q, k and v value to be selected and getting the same output as the head(i) output.
concatenation of head is to the weight, which if you see when head(i) is changed the attention weight is applied to all 3 multinhead attention, allowing the relative q, k and v value to be selected and getting the same output as the head(i) output.