[Week4] create_padding_mask: shape-confusion

vanessaca · May 4, 2021, 11:01am

Hi,

I am not getting my head around the following:
In the excercise we create a padding mask which ends up with the shape: (b, 1, 1, s) (with b= batchsize and s = max sequence length). Since we are using later the MultiHeadAttention-layer from keras, I had a look in the documentation to be sure to feed the mask properly into it. There I find: " attention_mask : a boolean mask of shape [B, T, S] , that prevents attention to certain positions." (where, B = batchsize, T = sequence length of query, S = sequence length of value).
Therefore, I would imagine the mask for self attention to have a shape as (b, s, s). Furthermore, as far as I understand boolean, this would mean values of “True” and “False”, instead “1” and “0”.

I would be very happy, if someone knows how to feed the padding mask into the keras layer. And could help my confusion.

vanessaca · May 5, 2021, 7:49am

Links to the documentation pages:

edwardyu · May 12, 2021, 7:43am

Hi,

The mask function in exercise is a little bit different from MultiHeadAttention implementation in tensorflow core api. Though, api document said the shape of attention_mask is [B, T, S], it will be expanded to [B, H, T, S] eventually, where H is number of heads. So, both are acceptable. Besides, because of shape broadcasting, you can use [B, 1, S] or [B, 1, 1, S].
However, there is one big different between our exercise and core api implementation, i.e., they are 1-complement of each other. In other words, value 1 (True) mask out padding in exercise, but in core api, value 0 (False) is to mask out padding.
Here is an example to create masks for MultiHeadAttention and train your Transformer program assignment.

Topic		Replies	Views
C5W4A1 Exercise 3 - scaled_dot_product_attention Sequence Models	5	1208	July 12, 2021
Course5_week4 Size of mask after softmax Sequence Models	6	688	March 15, 2025
Intuition about the application of padding masks and look-ahead masks in Transformer's encoder/decoder Sequence Models	3	825	September 3, 2021
Q about keras doc of tf.keras.layers.MultiHeadAttention Sequence Models	6	560	July 18, 2021
DLS 5 Week 4 Ex 4 Sequence Models	1	536	December 12, 2021

[Week4] create_padding_mask: shape-confusion

Related topics