C4W2 Exercise 7.1 Decoder Layer Error

medsharan · December 17, 2025, 11:27am

I have implemented the class properly as far as I understand, but perhaps there is some issue in some inputs that I am passing to the layers.

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
Cell In[94], line 11
      8 encoder_test_output = tf.convert_to_tensor(np.random.rand(1, 7, 8))
      9 look_ahead_mask = create_look_ahead_mask(q.shape[1])
---> 11 out, attn_w_b1, attn_w_b2 = decoderLayer_test(q, encoder_test_output, False, look_ahead_mask, None)
     13 print(f"Using embedding_dim={key_dim} and num_heads={n_heads}:\n")
     14 print(f"q has shape:{q.shape}")

File /usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     67     filtered_tb = _process_traceback_frames(e.__traceback__)
     68     # To get the full stack trace, call:
     69     # `tf.debugging.disable_traceback_filtering()`
---> 70     raise e.with_traceback(filtered_tb) from None
     71 finally:
     72     del filtered_tb

Cell In[93], line 67, in DecoderLayer.call(self, x, enc_output, training, look_ahead_mask, padding_mask)
     61 Q1 = self.layernorm1(x+mult_attn_out1)
     63 # BLOCK 2
     64 # calculate self-attention using the Q from the first block and K and V from the encoder output. 
     65 # Dropout will be applied during training
     66 # Return attention scores as attn_weights_block2 (~1 line) 
---> 67 mult_attn_out2, attn_weights_block2 = scaled_dot_product_attention(Q1, enc_output, enc_output, padding_mask)
     69 # # apply layer normalization (layernorm2) to the sum of the attention output and the Q from the first block (~1 line)
     70 mult_attn_out2 = self.layernorm2(Q1+mult_attn_out2)

Cell In[65], line 23, in scaled_dot_product_attention(q, k, v, mask)
      3 """
      4 Calculate the attention weights.
      5   q, k, v must have matching leading dimensions.
   (...)
     18     output -- attention_weights
     19 """
     20 ### START CODE HERE ###
     21 
     22 # Multiply q and k transposed.
---> 23 matmul_qk = tf.matmul(q, k, transpose_b=True)
     25 # scale matmul_qk with the square root of dk
     26 dk = tf.cast(len(k), tf.float32)

InvalidArgumentError: Exception encountered when calling layer 'decoder_layer_11' (type DecoderLayer).

cannot compute BatchMatMulV2 as input #1(zero-based) was expected to be a float tensor but is a double tensor [Op:BatchMatMulV2] name: 

Call arguments received by layer 'decoder_layer_11' (type DecoderLayer):
  • x=tf.Tensor(shape=(1, 15, 12), dtype=float32)
  • enc_output=tf.Tensor(shape=(1, 7, 8), dtype=float64)
  • training=False
  • look_ahead_mask=tf.Tensor(shape=(1, 15, 15), dtype=float32)
  • padding_mask=None

This is what I get, and I am unable to figure out what is causing this. Thanks in advance for any help.

medsharan · December 17, 2025, 11:41am

It seems to me like it has to do with the types of tensor (float32 vs float64), which I do not seem to have control over

medsharan · December 17, 2025, 11:47am

On further inspection, I found that I am not using mha1 and mha2 at all, since I am using the scaled_dot_product_attention function to get the output. I am unable to understand how to use mha1 and mha2 during forward pass.

TMosh · December 17, 2025, 8:00pm

I’m not a mentor for that course, but I am for another course which teaches NLP methods.

In that other course, the ‘scaled_dot_product_attention()’ function is only in the assignment for familiarity with the concept. It isn’t intended to be used in the remainder of the assignment. The mha() methods are better suited in practice.

Perhaps that is the situation in this course as well.

medsharan · December 17, 2025, 8:15pm

Hello, thanks a lot. I just finished the NLP specialisation along with this course as well.

Deepti_Prasad · December 18, 2025, 4:57am

@medsharan

a good practice is considered when you mention in your created topic on how you resolved your issue, so future learners find your topic thread helpful in case they encounter similar issue.

Good luck.

medsharan · December 18, 2025, 5:42am

Hello, thanks for the reminder.

It was basically just as TMosh said, the forward pass has to involve the layers made (mha1, mha2 and final_layer). Hence, the input has to go through all these layers one by one.

What took me very long to realize was that Multihead Attention is a specific type of layer (like Linear Activation layer) defined in the tensorflow documentation, and there is a specific way it accepts inputs and gives outputs. Reading the documentation should resolve most doubts for this exercise.

Good luck to future learners!

Topic		Replies	Views
C4W2 Exercise 2 Decoder Layer Implemented correctly but getting errors NLP with Attention Models week-module-2	9	793	May 2, 2024
C4W2_Assignment - Ex 7 Decoder Layer output NLP with Attention Models week-module-2	12	421	April 4, 2024
C5 W4 A1 DecoderLayer Ex6 Sequence Models coursera-platform	11	738	November 18, 2024
C5_W4_A1_UNQ_C6 Decoder Layer Sequence Models coursera-platform	4	876	August 4, 2021
Question about C4_W2_Assignment on Exercise 2 - DecoderLayer NLP with Attention Models week-module-2	8	500	February 19, 2024

C4W2 Exercise 7.1 Decoder Layer Error

Related topics