In the programming exercise:
I was thinking the mask should have the same shape as the input “seq”. Why are 2 new dimensions are needed? Also, as a result, the 2 different softmax calculated next have different shapes? [3, 5] vs. [3, 1, 3, 5]??
In the programming exercise:
I was thinking the mask should have the same shape as the input “seq”. Why are 2 new dimensions are needed? Also, as a result, the 2 different softmax calculated next have different shapes? [3, 5] vs. [3, 1, 3, 5]??
Mask is used in scaled_dot_product_attention, and added to (q multiplied by k transpose), as a result, its shape has to be compatible with (batch_size, header_nums, seq_len, seq_len).
By the way, the softmax calculation example w/ and w/o mask is just to demonstrate how it works. In fact, we won’t add mask to input sequence x directly in MultiHeadAttention.