Conceptual Questions about Transformers

balaji.ambresh · December 2, 2022, 12:25pm

Please see the next point.
The encoder output we want to pass to the decoder are the modified embeddings of the encoder terms. Look at class Encoder for more details regarding this. Here’s a recap:
a. Perform dot product self attention to understand the similarity between encoder terms.
b. Use this information to update embeddings before passing them to the decoder.
1.We need to pass the translated sentence as input during training. The expected output is one step right shifted translated output. This is done so that we can calculate loss for all terms in parallel. Please go back to the lecture(s) to understand look ahead masking for the role it plays in decoder attention.
The weights are used in grading. See scaled_dot_product_attention_test for details.
Decoder weights are also used for grading. See Decoder_test and Transformer_test

Topic		Replies	Views
C5_W4 Transformer - Flummoxed. Why do we pass the output sentence to the decoder Sequence Models	6	526	May 17, 2023
W4 - Assignment: Why do we only update the attention weights in the decoder, but not in the encoder? Sequence Models	2	534	December 2, 2022
Week 4: Transformer Network (test time intuition) Sequence Models	1	514	April 21, 2022
C5W4 Questions after finish the course Sequence Models	5	263	December 30, 2023
Course 5 Week 4 Assignment: Why are attention weights returned in DecoderLayer Sequence Models	1	730	June 29, 2022