Hi Mentor,

Below is my code, Im trying to understand what is the dimension for attention_weights for computing decoder layer . Im trying to mask/hide attention scores from position 6th -10th in the sequence of words by keep only first five attention scores to predict 6th word in the target. can you please help what mistake im making here ?

import tensorflow as tf

import numpy as np

from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization

layer = MultiHeadAttention(num_heads=8,

key_dim=6,

dropout=0.1)

x1 = np.random.rand(1, 10, 12)

x2 = np.random.rand(1, 10, 12)

x3 = np.random.rand(1, 10, 12)

x4 = np.random.rand(1, 10, 5)

attention_output,weights = layer(query = x1,key = x2,value = x3,attention_mask=x4,training=True,return_attention_scores=True)

print(weights.shape)