C5_W4_A1_transformer_subclass questions about padding and create_padding_mask

Frederico_Severgnini · May 10, 2024, 7:35pm

Hello all,

One of the already implemented functions is create_padding_mask, which “Creates a matrix mask for the padding cells”. Here are my questions about it:

The notebook states:

After masking, your input should go from [87, 600, 0, 0, 0] to [87, 600, -1e9, -1e9, -1e9] , so that when you take the softmax, the zeros don’t affect the score.

Why the 0 entries turn into 1e-9 for the Softmax? The operation it does is exp(x), so in principle it could take exp(0). It’s not clear to me why replacing that to 1e-9 helps at all.

Create_padding_mask flags the zero entries, since those actually represent padding, and not real words used in the sentence. Replacing 0.0 with anything else feels even stranger now: if those entries are flagged and won’t be used for softmax, why bother transforming them into 1e-9, or anything else for that matter?
Why are the output dims of create_padding_mask not the same as the input?
The output result is (3,3,5) from a (3,5) cell. Additionally, I can’t make sense of that this expanded tensor represents: When inspecting the 2nd tensor, I can see the first and second rows match the zero-flagging for the zero entries. But the third matrix (col 3) looks like it should have the softmax value for “3.0”.

image1460×496 66.5 KB

paulinpaloalto · May 11, 2024, 12:41am

The key point is that it’s not 1e-9, it’s -1e9. Think about the difference. -1e9 is a huge negative number. In normal mathematical notation that is -1 * 10^9 = -1000000000. Or minus 1 billion. Or as they comment in the notebook, you can think of it as -\infty.

Now think about how e^x works.

e^0 = 1
e^-1 = 0.3678....
e^{-10} = 0.0000453....
e^{-100} = 3.72 * 10^{-44}

And so forth. So you can see that e to the power negative one billion will be a very very small number. I’m sure it rounds to 0 in floating point. That’s why that does exactly what they stated as the goal: it makes those entries disappear when you compute the softmax of the inputs.

They aren’t flagged and that’s the point of the above. You could think of this as how they flag them.

They comment in the notebook that they add the trivial second dimension to the mask so that it can be broadcast with 3D tensors. I agree that it’s pretty confusing to sort out the meaning of the output in terms of the three dimensions. I’m trying to parse that now and will add more comments if I can figure anything out about that.

Frederico_Severgnini · May 12, 2024, 11:45pm

Hi @paulinpaloalto ,
I missed the minus sign location! Thanks for the clarification, it makes a lot of sense now. So after replacing 0 to -1e9, we can safely add the padding entries to softmax as well, without having them skew the probability calculation to the other terms.

Regarding the last point (dimensionality analysis): I’m guessing it would make more sense for the notebook to provide a 3D tensor as example, then? I’m asking this because once we have a 2D tensor and the extra dimension comes in, it seems like something is off.

Thanks a lot for the interesting discussion. If you would have any additional info about the dimensions in create_padding_mask, I would love to hear about it!

Cheers,

Topic		Replies	Views
C5W4: padding mask in transformer Sequence Models	2	558	March 15, 2025
Week 4 Transformer create_padding_mask function Sequence Models	1	544	September 22, 2021
C5w4 2.1 Padding mask Sequence Models week-4	9	288	March 9, 2024
C5_W4 Masking issue (?!) Sequence Models week-4	2	136	May 16, 2024
Mistake in W4A1 - 2.1 Masking example Sequence Models	2	270	November 28, 2023

C5_W4_A1_transformer_subclass questions about padding and create_padding_mask

Related topics