I really don’t understand what happens at the attention layer. What I believe occurs is as follows:
In our programming assignment’s specific example, queries, keys & values matrices have dims [batch size (# of sentences in given batch) , padded_input_length (# tokens per sentence, including padding tokens) , d_model (embedding dimension)] - (i.e. dim = [2,3,4]).
Looking at the slide from 2:22 into the Queries, Keys Values & Attention video, what I think is happening is as follows:
The Q and K.T matrices are matmul’d together. I’m assuming that this occurs by looping over the batch dim. So the [0,:,:] Q & K.T matrices are matmul’d, as are the [1,:,:] Q & K.T matrices, resulting in a [2,3,3] matrix, since the original Q & K matrices had dimensions of [2,3,4].
Each element of the resulting matrix is then scaled by a factor of sqrt(d_k), where d_k = num tokens per sentence, including padding tokens.
The mask may (or may not) be applied to ensure that future words do not effect the prediction of past words.
The scaled dot product attention is application of a softmax function. Again, I’m used to softmax functions being applied to 1D vectors, not 3D matrices. I think each row of the QK.T matrices contains the projections of each Q vector onto a basis space of K vectors (i.e. the 1st row gives the projection of the 1st Q vector onto a basis space described by the K vectors.). So, the softmax would be applied over the first axis of the full QK.T matrix (e.g. [0,:,0], [0,:,1], [0,:,2] … [1,:,2] - remember I believe QK.T is a [2,3,3] dimensional matrix). The purpose of the Softmax, in this case is to normalize the resulting vectors, it does not have a probabilistic interpretation.
This result is matmul’d with V, which is a [2,3,4] dim matrix. So, we again obtain a [2,3,4] dim. matrix which is formed in the same way I described QK.T being formed. The effect of this is to take each of the Q vectors, which have been expressed as combinations of K vectors and re-express them in the original embedding space.
A residual connection is used to add the original query matrix to the output of the attention layer. I don’t fully understand why it is being done. I’ve only seen this done in very deep nets, and then, it was done repeatedly, not just once. Is this to somehow ensure the attention layer only makes slight modifications to the query matrix? Or does it somehow refine attention in some other way?
Any confirmations or corrections would be extremely welcome. (Note: I tried looking at the Trax code, but only became more confused.)
Re. your question why residual connection is used in attention, I actually think the explanation lies with course 4 week 2 materials, when Transformers are introduced. I think a lot of things will become clear after week 2.
I’d like to add, that you understand things pretty well. Here is my comments on your points:
I think you got it right (but no looping is done - just vectorized multiplications). This resulting matrix represent similarities between each combination of tokens from queries and keys.
d_k is the dimension of the keys (usually the Embedding dimension size) not the length of the sequence with padding.
Yes.
Application of softmax is to get the Attention weights (they are later used to get Scaled Dot Product Attention) - in other words - to make similarities go from 0 to 1.
This is the result that is called Scaled Dot Product Attention.
You understand it right - this connection is for deep models (in order to not lose track).
In next week’s Lab (C4 W2 Lab Attention) there is a better explanation of the Attention layer.
@Jenna and @arvyzukai - Thanks for your detailed replies. I hope I’m not beating a dead horse here, but I’m still puzzled by the residual connection. This is not a particularly deep net, nor does it have several residual connections, as most resnets do. Also, the residual connection could have chosen to use the K or V matrices as the value to be forwarded. Why was the Q matrix chosen? I suspect it’s because the whole point of the Attention layer was just to slightly tune the Q matrix, but I could be wrong. Either way, this seems (to me) to be an unusual use of a residual connection and I’m wondering was there a motivating thought behind it, or was it just something that was tried and worked? However, again, thank you for you detailed answers to my questions.
Usually Transformer models are much deeper but in order for learners to understand the architecture, I think the course creators chose to include it for learning purposes. It could have also helped with performance in this particular case, but I’m not sure. As an exercise, you could try to train without residual, compare the results and let us know
Why was the Q matrix chosen? I suspect it’s because the whole point of the Attention layer was just to slightly tune the Q matrix
Yes, you are right, loosely speaking, the hole purpose of Attention is to “tune Q matrix” so that the model “would know which word needs translation” and residual connection for “not to alter the original sentence too much” (eg. “I have a car” would not become “I have a bicycle”).