The purpose of the Mask

Mohammad_Atif_Khan · December 26, 2022, 4:26pm

In the week 1 assignment (NMT using Attention), why do we need the mask? where is it used precisely?

Juan_Olano · December 26, 2022, 5:44pm

Hi @Mohammad_Atif_Khan , I can share some information about masks in the ‘transformers’ models, as per the “Attention is all you need”.

In these models, masks are used to prevent the model from attending to certain parts of the input sequence when processing a given element of the sequence. For example, in natural language processing , the model might be processing a sentence and trying to predict the next word in the sequence. The mask can be used to prevent the model from attending to future words in the sentence, as those words have not yet been seen and should not be used to make the prediction.

Masks can also be used to handle variable-length input sequences and to ensure that the model only attends to the relevant parts of the input. This is particularly useful in tasks such as machine translation, where the input and output sequences may have different lengths.

So in general, masks allow the model to focus on the relevant parts of the input and avoid using invalid information when making predictions.

Again, this is information about masks that can provide some light to your questions.

Juan

Mohammad_Atif_Khan · December 27, 2022, 3:21pm

Thanks @Juan_Olano .

can you also please the tell why are we using AddLossWeights() function in cell #9 of the NMT assignment under the Bucketing section? I’ve pasted the cell here for brevity (see last 2 lines):

# Bucketing to create streams of batches.

# Buckets are defined in terms of boundaries and batch sizes.
# Batch_sizes[i] determines the batch size for items with length < boundaries[i]
# So below, we'll take a batch of 256 sentences of length < 8, 128 if length is
# between 8 and 16, and so on -- and only 2 if length is over 512.
boundaries =  [8,   16,  32, 64, 128, 256, 512]
batch_sizes = [256, 128, 64, 32, 16,    8,   4,  2]

# Create the generators.
train_batch_stream = trax.data.BucketByLength(
    boundaries, batch_sizes,
    length_keys=[0, 1]  # As before: count inputs and targets to length.
)(filtered_train_stream)

eval_batch_stream = trax.data.BucketByLength(
    boundaries, batch_sizes,
    length_keys=[0, 1]  # As before: count inputs and targets to length.
)(filtered_eval_stream)

# Add masking for the padding (0s).
train_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(train_batch_stream)
eval_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(eval_batch_stream)

this is in addition to the Mask being created in the network:

# UNQ_C3
# GRADED FUNCTION
def prepare_attention_input(encoder_activations, decoder_activations, inputs):
    """Prepare queries, keys, values and mask for attention.
    
    Args:
        encoder_activations fastnp.array(batch_size, padded_input_length, d_model): output from the input encoder
        decoder_activations fastnp.array(batch_size, padded_input_length, d_model): output from the pre-attention decoder
        inputs fastnp.array(batch_size, padded_input_length): input tokens
    
    Returns:
        queries, keys, values and mask for attention.
    """
    
    ### START CODE HERE ###
    
    # set the keys and values to the encoder activations
    keys = encoder_activations
    values = encoder_activations

    
    # set the queries to the decoder activations
    queries = decoder_activations
    
    # generate the mask to distinguish real tokens from padding
    # hint: inputs is positive for real tokens and 0 where they are padding
    mask = inputs>0
    
    ### END CODE HERE ###
    
    # add axes to the mask for attention heads and decoder length.
    mask = fastnp.reshape(mask, (mask.shape[0], 1, 1, mask.shape[1]))
    
    # broadcast so mask shape is [batch size, attention heads, decoder-len, encoder-len].
    # note: for this assignment, attention heads is set to 1.
    mask = mask + fastnp.zeros((1, 1, decoder_activations.shape[1], 1))
        
    
    return queries, keys, values, mask

arvyzukai · December 28, 2022, 10:38am

Hi @Mohammad_Atif_Khan

In this particular case, these lines tells trax to not care about the “padding” (the second part of @Juan_Olano’s answer).

This way we do not reward/penalize model’s weights if it predicts correctly or not the padding tokens. Loss weights of 0 for padding tokens results in loss 0 for these tokens (loss multiplied by 0) and 1 for every other token (whatever the loss is for these tokens, these values are multiplied by 1).

Topic		Replies	Views
Video: NMT with Attention NLP with Attention Models week-1	1	597	May 20, 2022
Masked Attention Transformers Sequence Models	6	793	September 27, 2024
C4W2 Question about Decoder self-attention layer masks NLP with Sequence Models week-2	4	186	April 29, 2024
How the mask in the training generator is used? NLP with Attention Models week-2	1	495	August 10, 2022
Transformer Decoder Mask Input NLP with Attention Models week-3	1	520	August 12, 2022

The purpose of the Mask

Related topics