Hello, I’m learning Transformer architecture and when I do a lab in coursera that related to multihead Attention and I don’t really understand size of “query, key, and value”. The reasons behind this:
I do by tensorflow frameworks and vectors that represented by rows vectors, so the vectors when I calculate q = <X, W_q> and with X size is (batch_size, sequence, vector_emb) , so q has size is (batch_size, sequence, dim_q). It is similar with k (batch_size, sequence, dim_k) and v (batch_size. sequence, dim_v).
However, in assignment dim_q = dim_k = embed_dim at beginning. But in built-in function multiheadattention in tensorflow, dim_k = dim_v. Why?
If dim_q != dim_k. When doing multiply matrix to calculate attention score that not suited.
I may be missing the point of which formula you are asking this about and whether you are asking this in the lectures or specifically about one section of the assignment, but I think the relevant one is this from the scaled_dot_product_attention section of the assignment:
We are doing this one sample at a time so the first of the three dimensions vanishes and the remaining dimensions are:
Q is seq_len_q x depth
K is seq_len_k x depth
So the dimensions for the dot product work and you end up with seq_len_q x seq_len_k.
But then you need to take the dot product of that with V which is seq_len_v x depth_v.
So that is why we need to have seq_len_k == seq_len_v in order for that to work.
If you are asking about one of the lectures or a different part of the assignment, please let me know. If it’s a lecture, please show a screenshot or give the time offset.