Video: NMT with Attention

mazatov · November 11, 2021, 5:15pm

In the video titled " NMT with Attention" we quickly go over the entire NMT with attention model. Younes mentions that the inputs to the attention layer are Q,K,V and padding Mask. The padding mask is supposed to be a function of copies of inputs and targets.

As I understand:

K.shape = [batch_size, padded_input_length, encoder_hidden_state_dimension]
Q.shape = [batch_size, 1, decoder_hidden_state_dimension] 
V = K  ----> V.shape = K.shape
encoder_hidden_state_dimension = decoder_hidden_state_dimension

What exactly are we masking and later demasking?

Could somebody provide more information on this part? It’s the first time this is mentioned and seems quite important if you added it to the slides

reinoudbosch · May 20, 2022, 10:27pm

Hi Mazatov,

As a belated reply: the padding mask is used to mask out the padded elements of the key, query, and value vectors so that these padded elements are effectively not included in any calculations in the model.

Topic		Replies	Views
Having trouble understanding the Attention Layer NLP with Attention Models week-1	6	566	December 6, 2022
[Week4] create_padding_mask: shape-confusion Sequence Models	2	868	May 12, 2021
The purpose of the Mask NLP with Attention Models week-1	3	701	December 28, 2022
Intuition about the application of padding masks and look-ahead masks in Transformer's encoder/decoder Sequence Models	3	825	September 3, 2021
Questions regarding course 4 week 1 NLP with Attention Models week-1	1	577	August 3, 2022

Video: NMT with Attention

Related topics