C5 W4 A1 Ex-3 Questions (scaled_dot_product_attention)

I’ve gone through the forum and it looks like this last assignment has been difficult for everyone.

I’m currently trapped on Exercise 3, building the “scaled_dot_product_attention” function.

  1. Based on the exercise comments above the problem, which are wildly short compared to previous assignments, it looks like “M” should just be added if not defined as None, is that the intent? I assuming that the padding or look-ahead that may define it has already happened before being passed in, yes?

  2. The next hint in the comments refers to "Multiply (1. - mask) by -1e9 before applying the softmax. ", which is confusing coming out of nowhere like that. This should only be used if the mask is not None, correct? So you just multiply (1 - mask) by -1e9, and then do what with that value? You multiply it against “scaled_attention_logits” also?

  3. For the first time the “softmax” function is not defined for us, and there is no numpy version for softmax, so I assume we are using the “tf.keras.activations.softmax” function reference in one of the earlier code examples, is that right? If so, any hints on using it here, beyond what is said in the documentation.

Thanks for any help. I am already finding this as difficult as everyone else has said, I’m only 2 hours in.

3 Likes

Hi @Adam_Moses ,

Oh yes, I remember how hard this one was.

  1. The mask should only be used if it is “None”, so yes, you are correct that this is the intent. It would give an error applying it if None.

  2. Yes, this Multiply (1. - mask) by -1e9 should only be applied if mask is not none, as the provided conditional shows. You asked “what to do with the value”… and then mentioned if it should be ‘multiplied’ against scaled_attention_logits. Well, the intent here is not to multiply. If you look at the diagram, you’ll find out that there may be a better operation to apply.

  3. You are correct, the softmax to be used here is the one you mention (tf.keras.activations.softmax). This will require 2 parameters. The first parameter comes from your question above. The second one would be the appropriate axis of the first parameter.

@Adam_Moses please try these hints. If it still doesn’t help, I’ll be happy to ‘uncover’ more hints :slight_smile:

Thanks,

Juan

2 Likes

Juan, thanks, per question 2, I still don’t understand what the “(1.0 - mask) * -1e9” is for, and I get you would only do this if the mask is not None.

I get that the mask is added to the scaled_attention_logits, but are you not supposed to add it alone, but instead add that “(1.0 - mask) * -1e9” value?

@Adam_Moses , sometimes the size of the input sequence is such that zeroes are added to the end of the sequence, which may affect the softmax. By multiplying the mask by -1e9 you are making these zeroes go to negative infinity.

Regarding (1 - mask) vs just mask:

Look at the create_padding_mask exercise in the same lab. Check out how the mask is being created. After running this function, you will get a mask matrix that is setting 1s where there’s a non-zero, and 0s where there’s a 0.

If you were to apply the mask to the values matrix, you would keep the zeroes of the values matrix, and multiplying these zeroes by -1e9 would result in zero… so the softmax would be fed with zeroes which would affect the calculations.

By applying (1-mask), you are basically converting the 1s to 0s and the 0s to 1s in the mask. When you add the (1-mask) to the values matrix, you are effectively enabling the -1e9 to convert those zeroes to negative infinity, and hence helping the softmax to produce a proper calculation.

Hope this sheds light to this part of your question.

Juan

1 Like

Okay, I think I follow this.

So you still apply the (1 - mask) to the softmax even if the mask is None then? Which would reverse some of the values of the softmax, is that right?

You only apply (1-mask) if there’s a mask. If the mask is None, then you don’t want to apply the mask.