W4 A1 number of heads vs

concatenation of head is to the weight, which if you see when head(i) is changed the attention weight is applied to all 3 multinhead attention, allowing the relative q, k and v value to be selected and getting the same output as the head(i) output.