Video: NMT with Attention

In the video titled " NMT with Attention" we quickly go over the entire NMT with attention model. Younes mentions that the inputs to the attention layer are Q,K,V and padding Mask. The padding mask is supposed to be a function of copies of inputs and targets.

As I understand:

K.shape = [batch_size, padded_input_length, encoder_hidden_state_dimension]
Q.shape = [batch_size, 1, decoder_hidden_state_dimension] 
V = K  ----> V.shape = K.shape
encoder_hidden_state_dimension = decoder_hidden_state_dimension

What exactly are we masking and later demasking?

Could somebody provide more information on this part? It’s the first time this is mentioned and seems quite important if you added it to the slides :slight_smile:

Hi Mazatov,

As a belated reply: the padding mask is used to mask out the padded elements of the key, query, and value vectors so that these padded elements are effectively not included in any calculations in the model.