N_seq vs n_q

In the scheme below, is n_seq embedding length and n_q the number of tokens (words) in the query? It is very confusing as the description doesn’t distinguish them easily.
Why after random rotations, one can use argmax to short the input tokens/words?

Hi @PZ2004

No, n_seq is the input sentence (padded/truncated).

No, n_q is the dimensionality for Q (it could be the same as embedding dimension or you could choose some other number).

I’m not sure what you mean by that. If you have in mind the Step 4:

    ### Step 4 ###
    buckets = np.argmax(rotated_vecs, axis=-1).astype(np.int32)

Then in this case, the argmax is used not to “short the input” but to determine in which bucket does the input token land after rotations.

For example:

I know it’s not easy to see, but for each hash, we can determine in which bucket does the input token land. In this case:

  • the argmax (or bucket number) for token0, for hash0 is 1 (since 2.67 is the max value)
  • the argmax (or bucket number) for token0, for hash1 is 2 (since 3.03 is the max value)
  • the argmax (or bucket number) for token0, for hash2 is 0 (since 1.83 is the max value)
  • etc.

I hope that makes sense. Cheers

Many thanks for explaining using this example.
It appears to me, that the random rotations is used to generate random vectors (total numbering n_hash*n_buckets) with the same dimension as q_d. Each random vector represents a bucket. Then each token (entry) in the Q gets assigned to each bucket according to its similarity to the random vectors representing each bucket (by doing dot multiplication, the maximum value represents high similarity in the q_d dimensional space).

This is like clustering of the similar entries based on its vector in the q_d dimensional space using the random_rotation vectors as goal-posts. Then the doc product is only done between the entries that are similar. Since Q*Q.T is a variance covariance matrix, only between similar entries the covariance is high. between dissimilar entries the covariance is close to 0. By multiplying between similar entries we get a close to full picture of the entire variance/covariance matrix, with the uncalculated pairs assigned to 0s?