The purpose of the Mask

In the week 1 assignment (NMT using Attention), why do we need the mask? where is it used precisely?

Hi @Mohammad_Atif_Khan , I can share some information about masks in the ‘transformers’ models, as per the “Attention is all you need”.

In these models, masks are used to prevent the model from attending to certain parts of the input sequence when processing a given element of the sequence. For example, in natural language processing , the model might be processing a sentence and trying to predict the next word in the sequence. The mask can be used to prevent the model from attending to future words in the sentence, as those words have not yet been seen and should not be used to make the prediction.

Masks can also be used to handle variable-length input sequences and to ensure that the model only attends to the relevant parts of the input. This is particularly useful in tasks such as machine translation, where the input and output sequences may have different lengths.

So in general, masks allow the model to focus on the relevant parts of the input and avoid using invalid information when making predictions.

Again, this is information about masks that can provide some light to your questions.

Juan

1 Like

Thanks @Juan_Olano .

can you also please the tell why are we using AddLossWeights() function in cell #9 of the NMT assignment under the Bucketing section? I’ve pasted the cell here for brevity (see last 2 lines):

# Bucketing to create streams of batches.

# Buckets are defined in terms of boundaries and batch sizes.
# Batch_sizes[i] determines the batch size for items with length < boundaries[i]
# So below, we'll take a batch of 256 sentences of length < 8, 128 if length is
# between 8 and 16, and so on -- and only 2 if length is over 512.
boundaries =  [8,   16,  32, 64, 128, 256, 512]
batch_sizes = [256, 128, 64, 32, 16,    8,   4,  2]

# Create the generators.
train_batch_stream = trax.data.BucketByLength(
    boundaries, batch_sizes,
    length_keys=[0, 1]  # As before: count inputs and targets to length.
)(filtered_train_stream)

eval_batch_stream = trax.data.BucketByLength(
    boundaries, batch_sizes,
    length_keys=[0, 1]  # As before: count inputs and targets to length.
)(filtered_eval_stream)

# Add masking for the padding (0s).
train_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(train_batch_stream)
eval_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(eval_batch_stream)

this is in addition to the Mask being created in the network:

# UNQ_C3
# GRADED FUNCTION
def prepare_attention_input(encoder_activations, decoder_activations, inputs):
    """Prepare queries, keys, values and mask for attention.
    
    Args:
        encoder_activations fastnp.array(batch_size, padded_input_length, d_model): output from the input encoder
        decoder_activations fastnp.array(batch_size, padded_input_length, d_model): output from the pre-attention decoder
        inputs fastnp.array(batch_size, padded_input_length): input tokens
    
    Returns:
        queries, keys, values and mask for attention.
    """
    
    ### START CODE HERE ###
    
    # set the keys and values to the encoder activations
    keys = encoder_activations
    values = encoder_activations

    
    # set the queries to the decoder activations
    queries = decoder_activations
    
    # generate the mask to distinguish real tokens from padding
    # hint: inputs is positive for real tokens and 0 where they are padding
    mask = inputs>0
    
    ### END CODE HERE ###
    
    # add axes to the mask for attention heads and decoder length.
    mask = fastnp.reshape(mask, (mask.shape[0], 1, 1, mask.shape[1]))
    
    # broadcast so mask shape is [batch size, attention heads, decoder-len, encoder-len].
    # note: for this assignment, attention heads is set to 1.
    mask = mask + fastnp.zeros((1, 1, decoder_activations.shape[1], 1))
        
    
    return queries, keys, values, mask

Hi @Mohammad_Atif_Khan

In this particular case, these lines tells trax to not care about the “padding” (the second part of @Juan_Olano’s answer).

This way we do not reward/penalize model’s weights if it predicts correctly or not the padding tokens. Loss weights of 0 for padding tokens results in loss 0 for these tokens (loss multiplied by 0) and 1 for every other token (whatever the loss is for these tokens, these values are multiplied by 1).