I’m a bit confused by the explanation of the transformer model. Can someone clarify the following:
- How are the words x represented? Are they one-hot vectors or word embeddings?
- Is there an error in the explanation of the multi-headed model? I thought Q=W^QX. Why are the inputs to the multi-headed model WQ? I suspect this should be W^QX or just Q with some subscripts to denote the head number. Same goes for K and V