Conceptual questions about encoder / decoder from the "Generating text with transformers" video

  1. What are the details of the embedding of the tokens? When I previously used Word2Vec, the dimensions of the embedding were conceptually the co-occurance of the words in the text, and the dimensions were dictated by the size of the vocabulary. This seems different - the number of dimensions doesn’t seem related to the tokenization. What are the dimensions and how are the embeddings generated? Can someone explain or provide a pointer?

  2. What are the dimensions of the output of the Encoder that is fed into the Decoder? In the example they pass 3 tokens into the encoder, and if the output of the enoder is the logits for each token, are the dimensions of the output “num tokens” x “vocabulary size”? Or something else? How is that matrix used in the Decoder to influence the self-attention weights?

Thanks for your help.

You can check the Natural Language Specialization to understand this more in depth. Or some other resource.