In the video titled " NMT with Attention" we quickly go over the entire NMT with attention model. Younes mentions that the inputs to the attention layer are Q,K,V and padding Mask. The padding mask is supposed to be a function of copies of inputs and targets.
As I understand:
K.shape = [batch_size, padded_input_length, encoder_hidden_state_dimension]
Q.shape = [batch_size, 1, decoder_hidden_state_dimension]
V = K ----> V.shape = K.shape
encoder_hidden_state_dimension = decoder_hidden_state_dimension
What exactly are we masking and later demasking?
Could somebody provide more information on this part? It’s the first time this is mentioned and seems quite important if you added it to the slides