Query Input Last Dimension

Anbu · May 7, 2022, 2:58pm

Dear Mentor,

In the below context, what E means ?

attention_output The result of the computation, of shape (B, T, E) , where T is for target sequence shapes and E is the query input last dimension if output_shape is None . Otherwise, the multi-head outputs are project to the shape specified by output_shape .

Anbu · May 8, 2022, 1:03pm

Dear Mentor can someone pls help to answer this ?

Jaime_Gonzalez · May 10, 2022, 10:52am

Hi @Anbu

Please state the week number, assignment name and Exercise number you are referring to so that we can understand the context

Jaime

Jaime_Gonzalez · May 10, 2022, 11:22am

So If think I found the section you are referring to: C5 W4 “Programming Assignment: Transformers Architecture with TensorFlow”, E6 “Decoder Layer”

Specifically, your issue is on Block 2, where the notebook says:

        # BLOCK 2
        # calculate self-attention using the Q from the first block and K and V from the encoder output. 
        # Dropout will be applied during training
        # Return attention scores as attn_weights_block2 (~1 line) 
        mult_attn_out2, attn_weights_block2 = self.mha2(None, None, None, None, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)
        
        # apply layer normalization (layernorm2) to the sum of the attention output and the output of the first block (~1 line)
        mult_attn_out2 = None  # (batch_size, target_seq_len, fully_connected_dim)

And you got that phrase from Tensorflow docs: tf.keras.layers.MultiHeadAttention | TensorFlow Core v2.8.0

Jaime_Gonzalez · May 10, 2022, 11:31am

Reading the Tensorflow docs I have come to the solution:

E refers to “the query input last dimension”

If you look at the call arguments, ‘dim’ is the query last dimension:
query: Query Tensor of shape (B, T, ***dim***).

This means that, in Block 1 of the assignment in exercise 6, for whetever Q1 outputs, the last dimension of that output will be E.

Therefore, if you have managed to complete BLOCK 1 of the exercise, you have the first argument for BLOCK 2, “query= Q1”:

        mult_attn_out2, attn_weights_block2 = self.mha2(query= Q1, None, None, None, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)

Was this helpful?

Anbu · May 10, 2022, 5:01pm

Sir THanks for reply. Mentioned assignment is Exercise 4 Encoder layer Transformer architecture.

You meant Q1 is nothing Q=W * Q1 (Q1=X) ?

If its the case, attention output shape last dimension should be value_depth dimension only as per the lecuture , concepts and also exercise 3 scaled dot product attention.

  output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

Dont know sir why they mention Query Input as the last dimension instead of valud_depth as the last dimension.

Jaime_Gonzalez · May 11, 2022, 7:42am

Hi again @Anbu

I am having trouble understanding your writing and it is crucial that I understand it in order to help you.

May I suggest this tool? DeepL Translate: The world's most accurate translator

Specifically, I don’t understand this line:

If what you are trying to say is:
“So you mean Q1 ≠ X ?”
Then that is correct, Q1 does not equal X

In fact Q1 is what you get when you apply layer normalization (self.layernorm1) to the sum of the attention output (mult_attn_out1) and the input (x)

Anbu · May 11, 2022, 12:22pm

Hi Sir, sorry for making you confusion. I try to make clear now.

Below is the answer given by you. Im still not understand what does E means in attention output shape from tensorflow documentation

This means that, in Block 1 of the assignment in exercise 6, for whetever Q1 outputs, the last dimension of that output will be E.

Anbu · May 11, 2022, 12:24pm

What do you mean Q1 sir ?

Jaime_Gonzalez · May 12, 2022, 9:06am

This is what I mean by Q1 in the notebook:

        # apply layer normalization (layernorm1) to the sum of the attention output and the input (~1 line)
        Q1 = **ForYouToFillIn**

        # BLOCK 2
        # calculate self-attention using the Q from the first block and K and V from the encoder output. 
        # Dropout will be applied during training
        # Return attention scores as attn_weights_block2 (~1 line) 
        mult_attn_out2, attn_weights_block2 = self.mha2(query= **ForYouToFillIn**, value= **ForYouToFillIn**, key= **ForYouToFillIn**, attention_mask= **ForYouToFillIn**, return_attention_scores=**ForYouToFillIn**)  # (batch_size, target_seq_len, d_model)
        # returned mult_attn_out2 (attention_output - The result of the computation, of shape (B, T, E)), attn_weights_block2 (attention_scores - [Optional] multi-head attention coefficients over attention axes.)

By definition, Q1 is the sum of the attention output and the input after layer normalisation is applied.

Anbu · May 12, 2022, 1:31pm

Thank you sir.

So you meant mult_attn_output2 in the decoder whose attention output last dimension equals to Q1 output last dimension. Am I understand your answer ? But this is for decoder block.

Actually need clarification about encoder block attention output shape last dimension from where its actually getting dervied sir ?

Topic		Replies	Views
Attention Output shape Sequence Models	9	630	May 11, 2022
C4W2 assignment: query and key dimensions do not match in mha2 of the decoder NLP with Attention Models week-2	4	270	March 24, 2024
Transformer: dimensions of encoder output and decoder Q matrix Sequence Models	1	582	April 21, 2022
C5W4 Questions after finish the course Sequence Models	5	264	December 30, 2023
C5W4 Assignment: Multi-head attention output dimension Sequence Models week-4	2	269	January 18, 2024

Query Input Last Dimension

Related topics