C5 W4 A1 E4 - how to use self.mha

I don’t know how to do this, any tips apart from the ones they give? I don’t know where to get the query, value and key tensors, all we are given is x in the call function

My current attempt which did not work:

        # calculate self-attention using mha(~1 line). Dropout will be applied during training
        attn_output = self.mha(x, x, x, attention_mask=mask, return_attention_scores=False, training=False) # Self attention (batch_size, input_seq_len, fully_connected_dim)

Error I get:

AssertionError                            Traceback (most recent call last)
<ipython-input-47-00617004b1af> in <module>
      1 # UNIT TEST
----> 2 EncoderLayer_test(EncoderLayer)

~/work/W4A1/public_tests.py in EncoderLayer_test(target)
     92                        [[ 0.23017104, -0.98100424, -0.78707516,  1.5379084 ],
     93                        [-1.2280797 ,  0.76477575, -0.7169283 ,  1.1802323 ],
---> 94                        [ 0.14880152, -0.48318022, -1.1908402 ,  1.5252188 ]]), "Wrong values when training=True"
     96     encoded = encoder_layer1(q, False, np.array([[1, 1, 0]]))

AssertionError: Wrong values when training=True```

Tips they give:

The call arguments for self.mha are (Where B is for batch_size, T is for target sequence shapes, and S is output_shape):
query: Query Tensor of shape (B, T, dim).
value: Value Tensor of shape (B, S, dim).
key: Optional key Tensor of shape (B, S, dim). If not given, will use value for both key and value, which is the most common case.
attention_mask: a boolean mask of shape (B, T, S), that prevents attention to certain positions. The boolean mask specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. Broadcasting can happen for the missing batch dimensions and the head dimension.
return_attention_scores: A boolean to indicate whether the output should be attention output if True, or (attention_output, attention_scores) if False. Defaults to False.
training: Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (no dropout). Defaults to either using the training mode of the parent layer/model, or False (inference) if there is no parent layer.

For self-attenuation, Q, K, and V are all the ‘x’ matrix.
So try (x, x, x, mask)

You don’t need any other arguments. “Dropout will be applied during training” means that it happens in a different line of code, not in this one.