create_padding_mask(x) is supposed to take in input an (n,m) matrix and returns as an output an (n,1,m) dimensional binary tensor, as far as I understand.
Yet, when I implement it in my code for Exercise 3 scaled_dot_product_attention, I find that is not happening. I specifically find it takes input size (3,4) and returns output size (3,3,4).
I have the following code:
print(scaled_attention_logits.shape)
scaled_attention = scaled_attention_logits+ (1 - create_padding_mask(scaled_attention_logits)) * -1.0e9
print(scaled_attention.shape)
scaled_attention_logits returns shape (3,4) as expected.
scaled_attention returns shape (3,3,4) for some reason. Can you please tell me why?