I don’t know how to do this, any tips apart from the ones they give? I don’t know where to get the query, value and key tensors, all we are given is x in the call function
My current attempt which did not work:
# calculate self-attention using mha(~1 line). Dropout will be applied during training
attn_output = self.mha(x, x, x, attention_mask=mask, return_attention_scores=False, training=False) # Self attention (batch_size, input_seq_len, fully_connected_dim)
Error I get:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-47-00617004b1af> in <module>
1 # UNIT TEST
----> 2 EncoderLayer_test(EncoderLayer)
~/work/W4A1/public_tests.py in EncoderLayer_test(target)
92 [[ 0.23017104, -0.98100424, -0.78707516, 1.5379084 ],
93 [-1.2280797 , 0.76477575, -0.7169283 , 1.1802323 ],
---> 94 [ 0.14880152, -0.48318022, -1.1908402 , 1.5252188 ]]), "Wrong values when training=True"
95
96 encoded = encoder_layer1(q, False, np.array([[1, 1, 0]]))
AssertionError: Wrong values when training=True```
Tips they give:
The call arguments for self.mha are (Where B is for batch_size, T is for target sequence shapes, and S is output_shape):
query: Query Tensor of shape (B, T, dim).
value: Value Tensor of shape (B, S, dim).
key: Optional key Tensor of shape (B, S, dim). If not given, will use value for both key and value, which is the most common case.
attention_mask: a boolean mask of shape (B, T, S), that prevents attention to certain positions. The boolean mask specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. Broadcasting can happen for the missing batch dimensions and the head dimension.
return_attention_scores: A boolean to indicate whether the output should be attention output if True, or (attention_output, attention_scores) if False. Defaults to False.
training: Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (no dropout). Defaults to either using the training mode of the parent layer/model, or False (inference) if there is no parent layer.