As I understand Attention, the key calculation is the matrix multiplication (QK^T)V - not including the softmax or scaling factor.
Q is an LxD matrix, where L = # of query vectors and D = Dimension of all query vectors. Similarly, K is an MxD matrix, where M = # of key vectors and D = Dimension of all key vectors. Finally, V is an MxD matrix, where M = # of value vectors and D = Dimension of all value vectors. Each row of these matrices represents, respectively, a query, key or value vector.
The matrix product QK^T, is an LxM matrix where each row contains the dot product of the query vector from the corresponding row of the Query matrix with each key vector of the K matrix. After this matrix is multiplied by the V matrix, giving (QK^T)V, we obtain an LxD matrix where each row is the weighted sum of the row vectors of V. This gives us a set of vectors in V’s space. For an NMT problem, these vectors are then decoded into the target language.
So, when translating from English to German, I would expect the Query matrix would contain rows/vectors corresponding to English and the Key and Value matrices would contain rows/vectors corresponding to German.
But, the input encoder, which is fed input tokens (i.e. English tokens) produces the Key & Value matrices, whereas the pre-attention decoder, which is fed target tokens (i.e. German tokens) produces the Query matrix.
Somewhere, I have a major misunderstanding here. What am I missing?
As you can see in def prepare_attention_input, the keys, values, and queries will all have shape (batch_size, padded_input_length, d_model). For a single example, the dimensions will thus be (padded_input_length, d_model).
So Q.K^T will have dimension (padded_input_length, padded_input_length).
(Q.K^T).V will have dimension (padded_input_length, d_model).
The output of the model provides a probability for a choice to be made from the German vocabulary (as you are translating to German).
Maybe it helps to see queries, keys, and values not so much as related to the languages being used, but as expressing features of meaning, where the meaning features implicit in the input language are matched to meaning features contained in the output language. During training, the queries give indications of the meaning features implicit in the target. These are then matched with the keys, in order to select the most fitting meaning features in the values, updating the parameters of the model during the backward calibration process. In other words, the model is calibrated to match meaning features from one language to another.
Just making certain I understand the matrix dimensions:
batch_size just reflects the number of translations one wants to do at once.
padded_input_length = # of queries (or values or keys) + a certain number of blank/dummy rows to give the matrix a convenient size.
In general, although the key and value matrices must have the same input length, the query matrix may have a different length.
The key and value matrices must always have the same dimensions. They generally have the same values, but not necessarily. Although since each key vector corresponds to a value vector, I can’t think of a case where one would want them to differ.