Query Input Last Dimension

Dear Mentor,

In the below context, what E means ?

attention_output The result of the computation, of shape (B, T, E) , where T is for target sequence shapes and E is the query input last dimension if output_shape is None . Otherwise, the multi-head outputs are project to the shape specified by output_shape .

Dear Mentor can someone pls help to answer this ?

Hi @Anbu

Please state the week number, assignment name and Exercise number you are referring to so that we can understand the context

Jaime

So If think I found the section you are referring to: C5 W4 “Programming Assignment: Transformers Architecture with TensorFlow”, E6 “Decoder Layer”

Specifically, your issue is on Block 2, where the notebook says:

        # BLOCK 2
        # calculate self-attention using the Q from the first block and K and V from the encoder output. 
        # Dropout will be applied during training
        # Return attention scores as attn_weights_block2 (~1 line) 
        mult_attn_out2, attn_weights_block2 = self.mha2(None, None, None, None, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)
        
        # apply layer normalization (layernorm2) to the sum of the attention output and the output of the first block (~1 line)
        mult_attn_out2 = None  # (batch_size, target_seq_len, fully_connected_dim)

And you got that phrase from Tensorflow docs: tf.keras.layers.MultiHeadAttention  |  TensorFlow Core v2.8.0

Reading the Tensorflow docs I have come to the solution:

E refers to “the query input last dimension”

If you look at the call arguments, ‘dim’ is the query last dimension:
query: Query Tensor of shape (B, T, ***dim***).

This means that, in Block 1 of the assignment in exercise 6, for whetever Q1 outputs, the last dimension of that output will be E.

Therefore, if you have managed to complete BLOCK 1 of the exercise, you have the first argument for BLOCK 2, “query= Q1”:

        mult_attn_out2, attn_weights_block2 = self.mha2(query= Q1, None, None, None, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)

Was this helpful?

Sir THanks for reply. Mentioned assignment is Exercise 4 Encoder layer Transformer architecture.

You meant Q1 is nothing Q=W * Q1 (Q1=X) ?

If its the case, attention output shape last dimension should be value_depth dimension only as per the lecuture , concepts and also exercise 3 scaled dot product attention.

  output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

Dont know sir why they mention Query Input as the last dimension instead of valud_depth as the last dimension.

Hi again @Anbu

I am having trouble understanding your writing and it is crucial that I understand it in order to help you.

May I suggest this tool? DeepL Translate: The world's most accurate translator

Specifically, I don’t understand this line:

If what you are trying to say is:
“So you mean Q1 ≠ X ?”
Then that is correct, Q1 does not equal X

In fact Q1 is what you get when you apply layer normalization (self.layernorm1) to the sum of the attention output (mult_attn_out1) and the input (x)

Hi Sir, sorry for making you confusion. I try to make clear now.

Below is the answer given by you. Im still not understand what does E means in attention output shape from tensorflow documentation

This means that, in Block 1 of the assignment in exercise 6, for whetever Q1 outputs, the last dimension of that output will be E.

What do you mean Q1 sir ?

This is what I mean by Q1 in the notebook:

        # apply layer normalization (layernorm1) to the sum of the attention output and the input (~1 line)
        Q1 = **ForYouToFillIn**

        # BLOCK 2
        # calculate self-attention using the Q from the first block and K and V from the encoder output. 
        # Dropout will be applied during training
        # Return attention scores as attn_weights_block2 (~1 line) 
        mult_attn_out2, attn_weights_block2 = self.mha2(query= **ForYouToFillIn**, value= **ForYouToFillIn**, key= **ForYouToFillIn**, attention_mask= **ForYouToFillIn**, return_attention_scores=**ForYouToFillIn**)  # (batch_size, target_seq_len, d_model)
        # returned mult_attn_out2 (attention_output - The result of the computation, of shape (B, T, E)), attn_weights_block2 (attention_scores - [Optional] multi-head attention coefficients over attention axes.)

By definition, Q1 is the sum of the attention output and the input after layer normalisation is applied.

Thank you sir.

So you meant mult_attn_output2 in the decoder whose attention output last dimension equals to Q1 output last dimension. Am I understand your answer ? But this is for decoder block.

Actually need clarification about encoder block attention output shape last dimension from where its actually getting dervied sir ?