Why is extra dim needed in create_padding_mask in Transformer Network assignment

kechan · July 21, 2021, 2:37am

In the programming exercise:

I was thinking the mask should have the same shape as the input “seq”. Why are 2 new dimensions are needed? Also, as a result, the 2 different softmax calculated next have different shapes? [3, 5] vs. [3, 1, 3, 5]??

edwardyu · July 23, 2021, 8:53am

Mask is used in scaled_dot_product_attention, and added to (q multiplied by k transpose), as a result, its shape has to be compatible with (batch_size, header_nums, seq_len, seq_len).
By the way, the softmax calculation example w/ and w/o mask is just to demonstrate how it works. In fact, we won’t add mask to input sequence x directly in MultiHeadAttention.

Topic		Replies	Views
C5W4: padding mask in transformer Sequence Models	1	553	October 26, 2021
C5w4 2.1 Padding mask Sequence Models week-4	9	281	March 9, 2024
DLS Course 5, Week 4, assignment, Exercise 5 Sequence Models	2	534	July 13, 2022
Course5_week4 Size of mask after softmax Sequence Models	5	686	June 20, 2021
C5_W4_A1_transformer_subclass questions about padding and create_padding_mask Sequence Models week-4	2	145	May 12, 2024

Why is extra dim needed in create_padding_mask in Transformer Network assignment

Related topics