Hi,
The mask function in exercise is a little bit different from MultiHeadAttention
implementation in tensorflow core api. Though, api document said the shape of attention_mask
is [B, T, S]
, it will be expanded to [B, H, T, S]
eventually, where H
is number of heads. So, both are acceptable. Besides, because of shape broadcasting, you can use [B, 1, S]
or [B, 1, 1, S]
.
However, there is one big different between our exercise and core api implementation, i.e., they are 1-complement of each other. In other words, value 1 (True) mask out padding in exercise, but in core api, value 0 (False) is to mask out padding.
Here is an example to create masks for MultiHeadAttention
and train your Transformer program assignment.