How magical is the Transformer

I’ve completed week two, watched the videos multiple times and have read the “attention is all you need paper”. I feel like I have a reasonable mechanical understanding of what’s going on. I also understand many of the choices made for things like the attention scaling factor and why dot product attention is used instead of a dense layer.

But I do not actually understand why any of this works. In contrast, I was able to somewhat understand what CNNs are doing because we’re able to inspect what’s being detected at various layers. I could almost come up with a loose description of what I thought the learned function at different points might be. For the Transformer model I can’t really make sense of how these pieces fit together to predict text. What is a loose description of the function that is being learned in the feed forward part of the encoder? Is it different for each of the 6 invocations like in a CNN, or is it just distilling the same function over and over?

Could someone help me understand if I’m still just not familiar enough with the concepts to be able to get more than a mechanical understanding of what’s going on, or if models like Transformer are a little magical? I’m worried I’m in a cooking class, feverishly studying a recipe for bread and stressing out that although I can make a tasty loaf I don’t understand the chemistry behind it. Probably studying the bread recipe will never make me a chemist, so if this is my situation I should stop stressing out.

Agree 100%.

With your cooking class analogy, I feel like it’s one of those classes where you make bread during the workshop but you never make it at home because you never really understood what was going on and just followed the teacher’s instructions and was surprised in the end when something edible came out of the oven.


What helped me to better understand how transformers work is focusing on the function of Q, the Query. This query functions as a trained selector of values of features of meaning in word embeddings (compare this to a filter selecting values of features of images in a CNN in object recognition).

During training, the parameters influencing the values of the query are calibrated to give high values to values of meaning features in word embeddings that are relevant to creating the values of the meaning features of an input word embedding (in self-attention), or an output word embedding (in the encoder multi-head attention passed to the decoder and the output self-attention).

The values of the meaning features are selected from the various word embeddings on the basis of a match with the key K (which is also calibrated during training). If the dot product is high, the values of the meaning features of a value V related to a key are of high relevance to finding the values of the meaning features of the input word (in self-attention) and the output word (in multi-head attention and output self-attention) as expressed in their resulting embeddings.

So Q is trained to emphasize values in K (which is also trained), and next select values from V on the basis of the resulting values. Conceptually, these are meaning interdependencies. As there can be several meaning interdependencies, it makes sense to use multiple heads.

The way to put these different meaning interdependencies together is to normalize them and train a feed forward layer to weigh them, thereby again selecting features at increasing levels of abstraction of meaning at deeper layers in the network. Here is an interesting analysis of feed-forward layers, showing their meaning feature selecting function with some similarity to how CNNs select image features.

1 Like


In case others were feeling the way I was above, here are some additional resources that were shared with me and helped me get more comfortable:

However, I think the answer that I’ve come to for my original question is that (1) yeah, NLP and advanced ML generally appears to be more about experimentation and promoting what works than the other more mathematical/observational-based CS topics I study but (2) this should be exciting, not uncomfortable. Maybe uncomfortably exciting. :slightly_smiling_face:

If you are like me and feeling a little panic that you’re hitting a wall of understanding why the Transformer is so effective or why this particular function (e.g. multi-headed self-attention) is the right function to add to a neural network in order to achieve the best results over all the other possible functions, I think the answer really might be that this is just what has been tried and works. And it works shockingly, unreasonably well. This should be exciting because it might mean that we, as humans, are still in the early observational phases of this science and that if you’re not quite as satisfied with the explanations here as you are for, say, a physics class…maybe that dissatisfaction is signaling an opportunity to figure it out and advance the science for everyone. The dissatisfaction could be like a tooth ache–it’s what lets you know something is amiss and gives you the opportunity to fix it.

Or, maybe I need to study harder. If that’s the case please someone let me know.

Hi Matt_Stults,

Thanks for your reply. Great to have this discussion.

You are absolutely right that experimentation plays a very important role in AI.

But the idea behind attention is not devoid of logic. As Bahdanau, Cho, and Bengio write in the original attention paper:

“Each time the proposed model generates a word in translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.”

The idea behind this is that the meaning of a word depends on its context. Or, expressed generally, in order to understand the meaning of a word, the whole of the text it forms a part of needs to be taken into consideration (and, for even better understanding, the non-textual context). This idea is generally accepted in hermeneutic and interpretive approaches to language. See, for instance, this wikipedia page on the hermeneutic circle.

In the transformer, this hermeneutic/interpretive logic is implemented through the query-key/value combination, where the search for the most relevant information is performed by the query that results from the training of the model, selection occurs based on the key, and the related value is propagated. The values indicate features of meaning that together provide an indication of the meaning of a word in its context and the meaning interdependencies between words. Going forward through the transformer, more abstract levels of meaning features and interdependencies are constructed to finally arrive at the highest probability score for a word to be used in the translation (much like feature extraction and probability prediction in a CNN).

How is this captured in mathematics? Well, (scaled) dot product attention is one, keeping in mind the relevance of the mathematics of backpropagation/gradient descent in constituting the actual queries, keys, and values in combination with the word embeddings. The paper discussing the feedforward layer I referenced in my previous post provides an interesting mathematical analysis relating to higher level feature extraction.

Are certain elements tweaked, and did experimentation occur? Of course. But the idea to pay attention to context words in order to understand meaning did not come out of thin air, and the mathematical implementation makes sense in terms of the logic of linguistic interpretation (which is why it works better than preceding NLP models).

It seems to me that what can make the logic of the transformer difficult to understand, is that it requires a philosophical/linguistic understanding in addition to mathematical understanding.

I hope this adds an angle to your perspective.