C4W2 Exercise 7.1 Decoder Layer Error

I have implemented the class properly as far as I understand, but perhaps there is some issue in some inputs that I am passing to the layers.

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
Cell In[94], line 11
      8 encoder_test_output = tf.convert_to_tensor(np.random.rand(1, 7, 8))
      9 look_ahead_mask = create_look_ahead_mask(q.shape[1])
---> 11 out, attn_w_b1, attn_w_b2 = decoderLayer_test(q, encoder_test_output, False, look_ahead_mask, None)
     13 print(f"Using embedding_dim={key_dim} and num_heads={n_heads}:\n")
     14 print(f"q has shape:{q.shape}")

File /usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     67     filtered_tb = _process_traceback_frames(e.__traceback__)
     68     # To get the full stack trace, call:
     69     # `tf.debugging.disable_traceback_filtering()`
---> 70     raise e.with_traceback(filtered_tb) from None
     71 finally:
     72     del filtered_tb

Cell In[93], line 67, in DecoderLayer.call(self, x, enc_output, training, look_ahead_mask, padding_mask)
     61 Q1 = self.layernorm1(x+mult_attn_out1)
     63 # BLOCK 2
     64 # calculate self-attention using the Q from the first block and K and V from the encoder output. 
     65 # Dropout will be applied during training
     66 # Return attention scores as attn_weights_block2 (~1 line) 
---> 67 mult_attn_out2, attn_weights_block2 = scaled_dot_product_attention(Q1, enc_output, enc_output, padding_mask)
     69 # # apply layer normalization (layernorm2) to the sum of the attention output and the Q from the first block (~1 line)
     70 mult_attn_out2 = self.layernorm2(Q1+mult_attn_out2)

Cell In[65], line 23, in scaled_dot_product_attention(q, k, v, mask)
      3 """
      4 Calculate the attention weights.
      5   q, k, v must have matching leading dimensions.
   (...)
     18     output -- attention_weights
     19 """
     20 ### START CODE HERE ###
     21 
     22 # Multiply q and k transposed.
---> 23 matmul_qk = tf.matmul(q, k, transpose_b=True)
     25 # scale matmul_qk with the square root of dk
     26 dk = tf.cast(len(k), tf.float32)

InvalidArgumentError: Exception encountered when calling layer 'decoder_layer_11' (type DecoderLayer).

cannot compute BatchMatMulV2 as input #1(zero-based) was expected to be a float tensor but is a double tensor [Op:BatchMatMulV2] name: 

Call arguments received by layer 'decoder_layer_11' (type DecoderLayer):
  • x=tf.Tensor(shape=(1, 15, 12), dtype=float32)
  • enc_output=tf.Tensor(shape=(1, 7, 8), dtype=float64)
  • training=False
  • look_ahead_mask=tf.Tensor(shape=(1, 15, 15), dtype=float32)
  • padding_mask=None

This is what I get, and I am unable to figure out what is causing this. Thanks in advance for any help.

It seems to me like it has to do with the types of tensor (float32 vs float64), which I do not seem to have control over

On further inspection, I found that I am not using mha1 and mha2 at all, since I am using the scaled_dot_product_attention function to get the output. I am unable to understand how to use mha1 and mha2 during forward pass.

I’m not a mentor for that course, but I am for another course which teaches NLP methods.

In that other course, the ‘scaled_dot_product_attention()’ function is only in the assignment for familiarity with the concept. It isn’t intended to be used in the remainder of the assignment. The mha() methods are better suited in practice.

Perhaps that is the situation in this course as well.

1 Like

Hello, thanks a lot. I just finished the NLP specialisation along with this course as well.

1 Like

@medsharan

a good practice is considered when you mention in your created topic on how you resolved your issue, so future learners find your topic thread helpful in case they encounter similar issue.

Good luck.

Hello, thanks for the reminder.

It was basically just as TMosh said, the forward pass has to involve the layers made (mha1, mha2 and final_layer). Hence, the input has to go through all these layers one by one.

What took me very long to realize was that Multihead Attention is a specific type of layer (like Linear Activation layer) defined in the tensorflow documentation, and there is a specific way it accepts inputs and gives outputs. Reading the documentation should resolve most doubts for this exercise.

Good luck to future learners!

2 Likes