C4W1 dimensional error in Excercise4 Translator unit tests

I am having some trouble with Exercise4: Translator. While previous exercises pass and the basic one for the Exercise 4 does as well it fails on unit tests with the following error:

37 # Call the MH attention by passing in the query and value
     38 # For this case the query should be the translation and the value the encoded sentence to translate
     39 # Hint: Check the call arguments of MultiHeadAttention in the docs
---> 40 attn_output = self.mha(
     41     query=target,
     42     value=context
     43 )  
     45 ### END CODE HERE ###
     47 x = self.add([target, attn_output])

InvalidArgumentError: Exception encountered when calling layer 'key' (type EinsumDense).

{{function_node __wrapped__Einsum_N_2_device_/job:localhost/replica:0/task:0/device:GPU:0}} Expected dimension 512 at axis 0 of the input shaped [256,1,256] but got dimension 256 [Op:Einsum] name: 

Call arguments received by layer 'key' (type EinsumDense):
  • inputs=tf.Tensor(shape=(64, 19, 512), dtype=float32)

I am not hardcoding the 256 anywhere, just using the units parameter.

Here are the dimensions in consequent steps:

Translator: 10000 512
Encoder: 10000, 512
Decoder: 10000, 512
Translator Context: (64, 19)
Translator Target: (64, 17)
Encoder context: (64, 19)

Encoder embedding: (64, 19, 512)

Encoder LSTM: (64, 19, 512)

Translator: Encoded context: (64, 19, 512)
Decoder input: (64, 19, 512) (64, 17)
Decoder Embedding: (64, 17, 256)
Decoder LSTM: (64, 17, 256)

any ideas where is the issue?

Hi @Krzysztof_Jakubczyk

These lines are from Exercise 2, which means that even you passed the previous tests, the probable cause of this error is in your previous implementations of Exercise 1 and 2.

So make sure you have correct arguments for the Exercise 1 Encoder, as per instructions:

  • Embedding. For this layer you need to define the appropriate input_dim and output_dim and let it know that you are using ‘0’ as padding, which can be done by using the appropriate value for the mask_zero parameter.

  • Bidirectional LSTM. In TF you can implement bidirectional behaviour for RNN-like layers. This part is already taken care of but you will need to specify the appropriate type of layer as well as its parameters. In particular you need to set the appropriate number of units and make sure that the LSTM returns the full sequence and not only the last output, which can be done by using the appropriate value for the return_sequences parameter.

And that you also correctly forward pass these two layers in the call() part.

Same applies for Exercise 2 CrossAttention:

The cross attention consists of the following layers:

  • MultiHeadAttention. For this layer you need to define the appropriate key_dim, which is the size of the key and query tensors. You will also need to set the number of heads to 1 since you aren’t implementing multi head attention but attention between two tensors. The reason why this layer is preferred over Attention is that it allows simpler code during the forward pass.

A couple of things to notice:

  • You need a way to pass both the output of the attention alongside the shifted-to-the-right translation (since this cross attention happens in the decoder side). For this you will use an Add layer so that the original dimension is preserved, which would not happen if you use something like a Concatenate layer.

  • Layer normalization is also performed for better stability of the network by using a LayerNormalization layer.

  • You don’t need to worry about these last steps as these are already solved.

Note, that you only need to fill the code between ### START CODE HERE ### and ### END CODE HERE ### and “adding” is implemented for you (so that you wouldn’t be “concatenating” (which could have resulted in 512 dimension instead of 256 when adding)). And of course, specifying the right parameters is crucial.

Let me know if you have doubts regarding any of these points.