Hi,
I am not getting my head around the following:
In the excercise we create a padding mask which ends up with the shape: (b, 1, 1, s) (with b= batchsize and s = max sequence length). Since we are using later the MultiHeadAttention-layer from keras, I had a look in the documentation to be sure to feed the mask properly into it. There I find: " attention_mask : a boolean mask of shape [B, T, S]
, that prevents attention to certain positions." (where, B = batchsize, T = sequence length of query, S = sequence length of value).
Therefore, I would imagine the mask for self attention to have a shape as (b, s, s). Furthermore, as far as I understand boolean, this would mean values of “True” and “False”, instead “1” and “0”.
I would be very happy, if someone knows how to feed the padding mask into the keras layer. And could help my confusion.