Assignment - Mask padding before training

Why do we have to mask pad the loss weights of our data, by using the id_to_mask argument of [trax.data.inputs.add_loss_weights] ? We didn’t do this previously.

Create training data, mask pad id=35180 for training.

train_generator = trax.data.inputs.add_loss_weights(
data_generator(batch_size, t_sentences, t_labels, vocab[‘’], True),
id_to_mask=vocab[‘’])

Create validation data, mask pad id=35180 for training.

eval_generator = trax.data.inputs.add_loss_weights(
data_generator(batch_size, v_sentences, v_labels, vocab[‘’], True),
id_to_mask=vocab[‘’])

Regards,

From my understanding (I recently have gotten access to NLP specialization too), you need to pad mask the data because you need to have a fixed length of inputs and outputs, so beyond the size of the sentence, if the sentence is shorter than the fixed length, you pad with the mask character.

When the times comes to calculate the loss those pads will be the same and they wont contribute to the overall loss.

Hi @gent.spah

That is true. But the question of @Aaditya1 was why it was not done in the previous week.

I had no time to check, but the probable cause could be that the padding token id in the previous week was 0 and maybe (this needs to be checked) by default the training task (somewhere down the line) might assign these to 0 loss weights (even though I doubt it by just glancing over the trax implementation - the default add_weight_loss is None - no loss weights means the model gets penalized for not predicting correctly pad tokens and probably it can get away with this).

So, it could be just the mistake that we can get away with or the default value of 0 padding is accounted somewhere in the trax.

1 Like

Hi @arvyzukai,

What do you mean by ’ no loss weights means the model gets penalized for not predicting correctly pad tokens and probably it can get away with this).’ Can you explain how this penalization is related to loss weights?

Hi @Aaditya1

Let me explain with example from the previous week assignment. Here is a simple batch sample of 2:
image

If the loss is not “masked out” according to weights ([1, 1, 1, 0, 0, 0,…0]) then the model has to predict each 0 correctly. It’s not a very hard task (it’s easy to predict, that after 1, it’s always 0s) so that is why I say the model training can get away with this.

But the correct implementation should account for padded tokens and would not accumulate the loss where the mask is 0s.

In other words, the training updates model’s weights indistinguishably be it the token is 49 or 0 (but in reality we care more about 49 or 50 than 0s).

P.S. to check if this is true takes some time, and when I get the time (or someone else) then I could answer if this is the case - (if loss is calculated for padded tokens).

1 Like

Thanks a lot @arvyzukai , I got an intuition of what you are trying to explain, please let me know whenever you check for the case of loss being calculated for padded tokens.