Clarification of definitions in transformer model

I’m a bit confused by the explanation of the transformer model. Can someone clarify the following:

  1. How are the words x represented? Are they one-hot vectors or word embeddings?
  2. Is there an error in the explanation of the multi-headed model? I thought Q=W^QX. Why are the inputs to the multi-headed model WQ? I suspect this should be W^QX or just Q with some subscripts to denote the head number. Same goes for K and V

I had a look at the Vaswani paper. They talk about the WQ, WK, WV inputs to the mulit-headed model as projections of Q, K, V. I think I understand now, so the multi-headed version isn’t just a self-attention model with more channels, but there’s an additional layer that expands Q,K,V to add an extra dimension.
I’m guessing the words, x, can be represented in any way.