Conceptual Questions about Transformers

  1. Please see the next point.
  2. The encoder output we want to pass to the decoder are the modified embeddings of the encoder terms. Look at class Encoder for more details regarding this. Here’s a recap:
    a. Perform dot product self attention to understand the similarity between encoder terms.
    b. Use this information to update embeddings before passing them to the decoder.
    1.We need to pass the translated sentence as input during training. The expected output is one step right shifted translated output. This is done so that we can calculate loss for all terms in parallel. Please go back to the lecture(s) to understand look ahead masking for the role it plays in decoder attention.
  3. The weights are used in grading. See scaled_dot_product_attention_test for details.
  4. Decoder weights are also used for grading. See Decoder_test and Transformer_test
1 Like