I think the link covers mostly, but try to elaborate key points along with your question.
I think the word of “Query” is misleading in Transformer, but, that is “given” definition. So, we need to start with that.
As I wrote in this thread, Q is “target” and K/V is “source” to create mapping among them. Mapping can be for translation among multiple languages or relationship of words in a single language. The term of “query” is slightly misleading, since it is not a real query like SQL. 
the whole of W^Q can be thought of as just one “query” that we ask about each word?
As you see, Q (or even a dot product of Q and W^Q) consists of multiple words in a mixture of “word embedding” and “position encoding”, and is dispatched into multiple attention heads. The Important thing is not as one “query”, i.e, “whole sentence”, but is "each word vector like q^{<1>}W_1^{Q} to be associated with words in a source list like k^{<1>}W_1^{K}, k^{<2>}W_1^{K}, k^{<3>}W_1^{K}, ..., and creates relationships of among those, which are called “attention weights”.
W^Q itself does not vary by word or position, and why would we want to ask “what’s happening there?” about “Jane”?
It is not clear for me whether your W^Q means W^{Q} or a dot product of W and Q, but we should look at each word like the previous question.
Each query entry, i.e, word vector, includes “word embedding” and “position encoding”. So, “similarity” to other words can be calculated by dot product, cosine similarity. As you learned at seq-to-seq translation, word embedding is a key vector to represent the word meanings. In addition, we need to consider the position information which is typically lost in old RNN. With this, the Transformer can learn the first word “Jane” in Q can be associated with 2nd word, “visited” in K. There is no “what’s happening there?” types of query in here. That’s Andrew’s “intuition” talk.
each W^Q is just a bundle of queries, one per word, then there’s no reason for those queries to have any common meaning across words within heads, right?
QW^{Q} includes multiple words which consists “word embedding” and “position encoding”. Each heads receives part of q{<i>}W_h^Q, which may be part of word embedding or position encoding, but, receives all words. So, each head can create associations among words in Q (and K).
all the rows in W^Q ultimately have to work with the same v<1>, v<2>,… as their answer/value.
If that is for self-attention, we will create associations among words in Q and K, but all are weighted. And weights are learnable.
So maybe in different heads, we’ll get very different value matrices, some of which specialize in answering particular types of queries.
Each heads has slightly different attention weights, as the first head receives part of word embedding, and the second head receives part of position encoding. But, all has mapping for whole word. You can see examples at Course 5 Week 4 Assignment: Why are attention weights returned in DecoderLayer.
When, I took this course a few years ago, there is no Transfomer chapter. So my starting point was a paper. I believe you went through it, but, if not, please take a look. It’s a paper, but contents is not a classic paper, but is like Web article. Easy to read.
Then, I need a cup of coffee… 