attention_output The result of the computation, of shape (B, T, E) , where T is for target sequence shapes and E is the query input last dimension if output_shape is None . Otherwise, the multi-head outputs are project to the shape specified by output_shape .
So If think I found the section you are referring to: C5 W4 “Programming Assignment: Transformers Architecture with TensorFlow”, E6 “Decoder Layer”
Specifically, your issue is on Block 2, where the notebook says:
# BLOCK 2
# calculate self-attention using the Q from the first block and K and V from the encoder output.
# Dropout will be applied during training
# Return attention scores as attn_weights_block2 (~1 line)
mult_attn_out2, attn_weights_block2 = self.mha2(None, None, None, None, return_attention_scores=True) # (batch_size, target_seq_len, d_model)
# apply layer normalization (layernorm2) to the sum of the attention output and the output of the first block (~1 line)
mult_attn_out2 = None # (batch_size, target_seq_len, fully_connected_dim)
Sir THanks for reply. Mentioned assignment is Exercise 4 Encoder layer Transformer architecture.
You meant Q1 is nothing Q=W * Q1 (Q1=X) ?
If its the case, attention output shape last dimension should be value_depth dimension only as per the lecuture , concepts and also exercise 3 scaled dot product attention.
# apply layer normalization (layernorm1) to the sum of the attention output and the input (~1 line)
Q1 = **ForYouToFillIn**
# BLOCK 2
# calculate self-attention using the Q from the first block and K and V from the encoder output.
# Dropout will be applied during training
# Return attention scores as attn_weights_block2 (~1 line)
mult_attn_out2, attn_weights_block2 = self.mha2(query= **ForYouToFillIn**, value= **ForYouToFillIn**, key= **ForYouToFillIn**, attention_mask= **ForYouToFillIn**, return_attention_scores=**ForYouToFillIn**) # (batch_size, target_seq_len, d_model)
# returned mult_attn_out2 (attention_output - The result of the computation, of shape (B, T, E)), attn_weights_block2 (attention_scores - [Optional] multi-head attention coefficients over attention axes.)
By definition, Q1 is the sum of the attention output and the input after layer normalisation is applied.
So you meant mult_attn_output2 in the decoder whose attention output last dimension equals to Q1 output last dimension. Am I understand your answer ? But this is for decoder block.
Actually need clarification about encoder block attention output shape last dimension from where its actually getting dervied sir ?