Attention Mask Failure

Hi Mentor,

Below is my code, Im trying to understand what is the dimension for attention_weights for computing decoder layer . Im trying to mask/hide attention scores from position 6th -10th in the sequence of words by keep only first five attention scores to predict 6th word in the target. can you please help what mistake im making here ?

import tensorflow as tf
import numpy as np
from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization
layer = MultiHeadAttention(num_heads=8,
key_dim=6,
dropout=0.1)
x1 = np.random.rand(1, 10, 12)
x2 = np.random.rand(1, 10, 12)
x3 = np.random.rand(1, 10, 12)
x4 = np.random.rand(1, 10, 5)
attention_output,weights = layer(query = x1,key = x2,value = x3,attention_mask=x4,training=True,return_attention_scores=True)
print(weights.shape)

Hi Anbu,

You can have a look at this link. A correct code, considering the dimensions of the call arguments, would be the following:

import tensorflow as tf
import numpy as np
from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization
layer = MultiHeadAttention(num_heads=8, key_dim=6, dropout=0.1)
x1 = np.random.rand(1, 10, 6)
x2 = np.random.rand(1, 10, 6)
x3 = np.random.rand(1, 10, 6)
x4 = np.random.rand(1, 10, 10)
attention_output,weights = layer(query = x1,key = x2,value = x3,attention_mask=x4,training=True,return_attention_scores=True)
print(weights.shape)

Anything you want to mask out you have to do by means of the values in the mask, not through its dimensions.