Self attention and redundancy

Really enjoying learning about transformers. From what I understand in self attention Q,K,V are all the same matrices derived from the word embeddings in input texts. Considering that in multihead attention this same data is reused many times I wonder if it causes issues with data redundancy, such as overfitting? Do people typically transform the value matrix? A follow up question would be how do people typically apply regularization to transformers.

Finally for RNNs and information bottlenecking, is there a way to quantify how much information one loses with sequence length? Would be interested in learning more about how concepts in information theory such as entropy are used to better understand these networks. I acknowledge that this question is outside the scope of the class but just curious.

Thanks!

These are interesting questions and let me offer my thoughts

From what I understand in self attention Q,K,V are all the same matrices derived from the word embeddings in input texts.

Iâ€™m not sure I understand that fully so to be on the same page I would elaborate.

Just to be on the safe side Q, K, V are not â€śall the sameâ€ť - maybe itâ€™s just the wording you used but these actual matrices contain different numerical values (different from each other and different in each head (each have its own split from Q, K, V).

Considering that in multihead attention this same data is reused many times I wonder if it causes issues with data redundancy, such as overfitting?

Data quality is always very important no matter what the architecture of the Neural Network. To go to extreme, if your dataset is the same sentence over and over again, then the NN architecture you use doesnâ€™t matter. In NLP, usually that is not the case and the dataset contains a lot of complexity and one way to capture that complexity is to use Multi-Head Attention.

The loose illustration is that each â€śheadâ€ť focuses on different aspects of language (like male vs. female, singular vs. plural, etc).

In reality, each head just â€śworksâ€ť with a subset of (Q,K,V). For example, if embedding space is 12 and we have 3 heads, then after we have multiplied embedding matrix with Wq, Wk, Wv, we have Q, K, V matrices, then each head gets its input of 4 numbers (not 12) of each matrix (first 4, the next 4, the last 4). And back to loose words again - each head â€śspecializesâ€ť on these 4 numbersâ€™ split. The loss function makes them â€ścooperateâ€ť so that the end result is as good as possible (it makes each head to not specialize on the same thing - it tries to adjust Wq, Wk, Wv parameters so that Q, K, V matrices have the best numbers for each head).

All that is a long way to say that usually the natural language data have enough complexity and multi-head approach usually does not contribute to overfit (the number of tunable parameters contribute to overfit but multi-head or single head usually is not the problem).

Do people typically transform the value matrix?

Iâ€™m not sure what you mean. The value matrix is the transform of the embedding matrix (E @ Wv) that is used in QKV Attention without further modifications.

Maybe you are asking if we typically transform the output after concatenation of the result of Multi-Head Attention? Then yes - usually we have Wo weight matrix that multiplies the concatenation of Multi-Head.

A follow up question would be how do people typically apply regularization to transformers.

Typically itâ€™s the Dropout rate and there are other techniques that are used less frequently.

Finally for RNNs and information bottlenecking, is there a way to quantify how much information one loses with sequence length?

It obviously depends on the RNN (vanilla vs. GRU vs. LSTM) but I think it varies wildly even with the same flavor of RNN depending on:

• the dataset (NLP vs. stockmarket);
• the tokens (character vs. subword vs. word);
• number of layers;
• the activation functions used;
• and probably many more.

I think the loose intuition is how many times you can multiply the number and squash it with activation function and still find it useful For example, mm of rain today vs. yesterday vs. the day before and so on, after which point in time does the multiplication and squashing of each point (mm) does not give you predictive power?

Cheers

Hey @arvyzukai,

I believe what @Jose_James wanted to say was that in Self-Attention, Queries, Keys and Values come from the same sentence. Jose, please note that although they come from the same sentence, W^Q, W^K, W^V are still different from one another, and different for each of the heads as well, and hence, the values in the matrices are different, as Arvydas pointed it out.

P.S. - It was fun and interesting to read your reply, Arvydas

Cheers,
Elemento

1 Like