In Transformer Summary assignment, the dimension of the heads is calculated by dividing the feature embedding by the number of heads. In other words, each head is part of the initial word embedding for the q,k,v. Is this the right method or is this a specific implementation adopted for this assignment?
In Deep Learning Specialization (course 4 week 4), Andrew Ng explains the multiheads as several equal and parallel computations where there are separate learned q.W,k.W,v.W for each head and NOT just reshaping of the original embedding and then concatenating them.
In the implementation of the multihead in transformer summary assignment, the only benefits seems that each head can be computed parallel as against computing all as single head but that is not the original idea behind multihead attention.
To check this, I calculated multihead attention for some sample q,k,v first for entire embedding (without d_head = d_feature / n_heads) and then for several heads and subsequent concatenation. I find the results are same whether it is single head or multihead (except for the differences that might arise due to linear layer during multiple iterations of the decoder block).
Kindly clarify whether the d_head should be d_feature / n_heads or each d_head should be same as dimension of word embedding of q / x.