C5W4 Questions after finish the course

So, I have just finished the course and have a couple of questions.

  1. In the programming exercise A1, I am confused why the input_sentence and output_sentence in the Transformer class should have shapes like this
Arguments:
            input_sentence -- Tensor of shape (batch_size, input_seq_len, embedding_dim)
                              An array of the indexes of the words in the input sentence
            output_sentence -- Tensor of shape (batch_size, target_seq_len, embedding_dim)
                              An array of the indexes of the words in the output sentence

Aren’t they should be a 2D tensor instead as according to the Enecoder class call method

        """
        Forward pass for the Encoder
        
        Arguments:
            x -- Tensor of shape (batch_size, input_seq_len)

Because I think the batch of input should be “encoded” when it gets inside the encoder

  1. In the DecoderLayer class, Why should there be a padding_mask for self.mha2?
# BLOCK 2
        # calculate self-attention using the Q from the first block and K and V from the encoder output. 
        # Dropout will be applied during training
        # Return attention scores as attn_weights_block2 (~1 line) 
        mult_attn_out2, attn_weights_block2 = self.mha2(query=####, 
                                                        value=####, 
                                                        key=####,
                                                        attention_mask=####, 
                                                        return_attention_scores=####)  
                                                        # (batch_size, target_seq_len, embedding_dim)

The shape of Q, K, and V are not the same?

This is really a good course.

Posting solution code in a public topic is discouraged and can get your account suspended. It’s okay to share stacktrace on a public post and send code to a mentor via direct message. Please clean up the post.
Here’s the community user guide to get started.

  1. The 0th dimension is always batch size to make better use of the hardware. Please go back through rest of the labs if you have missed this detail. The entire batch of data is encoded ahead of encoding / decoding.
  2. The attention mask tells which positions to pay attention. We don’t want to pay attention to padding tokens and hence the need for the padding mask.
  3. See this in the DecoderLayer#call doc string:
        x -- Tensor of shape (batch_size, target_seq_len, embedding_dim)
       enc_output --  Tensor of shape(batch_size, input_seq_len, embedding_dim)

I have corrected the post, sorry.

Back to your answer NO 3. Sorry, I was talking about the block-level class, not the layer-level class. Let’s focus on the Encoder class first. You can see docstring of call method is like this

"""
        Forward pass for the Encoder
        
        Arguments:
            x -- Tensor of shape (batch_size, input_seq_len)
            training -- Boolean, set to true to activate
                        the training mode for dropout layers
            mask -- Boolean mask to ensure that the padding is not 
                    treated as part of the input
        Returns:
            out2 -- Tensor of shape (batch_size, input_seq_len, embedding_dim)
        """

The x is 2D, though. That’s why I am confused

Thanks for clarifying. Encoder class performs embedding inside the call method. So, input to the call method should be 2D (i.e. batch size, sequence length). See this as well.

Yes,sir. That is why I am asking you why the docstring in call method of Transformer class states that the input_sentence and output_sentence should have a 3D shape. Because I don’t think it makes sense to embed the already embedded tensor

"""
        Forward pass for the entire Transformer
        Arguments:
            input_sentence -- Tensor of shape (batch_size, input_seq_len, embedding_dim)
                              An array of the indexes of the words in the input sentence
            output_sentence -- Tensor of shape (batch_size, target_seq_len, embedding_dim)
                              An array of the indexes of the words in the output sentence
            training -- Boolean, set to true to activate
                        the training mode for dropout layers
            enc_padding_mask -- Boolean mask to ensure that the padding is not 
                    treated as part of the input
            look_ahead_mask -- Boolean mask for the target_input
            dec_padding_mask -- Boolean mask for the second multihead attention layer
        Returns:
            final_output -- Describe me
            attention_weights - Dictionary of tensors containing all the attention weights for the decoder
                                each of shape Tensor of shape (batch_size, num_heads, target_seq_len, input_seq_len)
        
        """

Are these errors, or I missed something?

You are correct. These lines in the docstring are incrrect.
The staff have been notified to fix the mistake.
Thank you