I initially thought the feature dimension (the third dimension) for both q and k must be the same, since they are multiplied using the dot product in the expression tf.matmul(q, k, transpose_b=True). Please refer to the scaled_dot_product_attention function in the assignment for more context.
Another question, what is the role of embedding_dim in this context? I noticed that changing its value has no noticeable effect. Are the third dimensions of both q and k meant to be equal to embedding_dim? Thanks in advance!
I dont see where the scaled_dot_product_attention is within the mha = tf.keras.layers.MultiHeadAttention(
num_heads=num_heads,
key_dim=embedding_dim,
dropout=0.1
) this function is taking 2 parameters but not the same as scaled_dot_product_attention!
The role of the embedding is how much information you capture from the context, a larger one increases it. No the embedding_dim is different than q, it refers to the vectors of the size of your tokens. There are many factors contributing here to see an immediate difference on the output by changing the embedding.
posting any part of the graded assignment code is violation of Code of conduct. Be careful when you post your query, you can always share the codes if mentor ask you by personal DM.
for your first query, you probably forgetting that the multiplication is of q and transposed k, and not q and k.
and yes the third dimension of q, k and v should be same as they are the embedding dimension.
if you notice the mha codes key_dim is mentioned as embedding dimension, so when this mha is used in decoding the input, the embedding vectors will be same as of q, k and v.
I apologize for that! I have just removed the code.
I believe that the key_dim is what the multi-head attention (MHA) expects the query (q) and key (k) to have in their third dimension. However, when we pass the q and k, their third dimensions are not the same. Despite this, it still works. Can you explain why?
So basically the multi head attention makes sures the q, k and v are same in dimension while recalling in the self.attention mechanism with the help of masking and positional encoding. then the attention weights used with the blocks in decoder layers helps the model understanding as we use the return attention score to true matching the attention output with the attention scores (remember in the scaled dot production the outputs where multiplication of the attention weights and value)
The decoder layer is able to get the exact match because in the encoder layer the mha with mask is used in the first step where q, k and v are same. Then skip connection is added to your input and the output of previously recalled mha. After this, the output is passed through normalisation layer.
This whole steps is repeated with a dropout layer instead of mha which makes sure when the model is trained for an given input, the decoding is from start to end.
Now notice in the decoder layer, look ahead mask shape is [1] and not [0] this make sure the model looks to next sequence and not the previous. So basically it is trying to match with the same mechanism encoder has been set up, for the q, k and v to be same.