I think I’m missing a fundamental concept in our implementation of the Transformer blocks. There are a few points in the flow where we want to augment our features with additional information: a position representation, or the features from the previous layer. It would make sense to me if we combined the two vectors (word encodings and positions, for example) by concatenating them into a longer vector, thus preserving the meaning of their individual feature dimensions. But that’s not what we do. We just add them together.
Conceptually this seems weird to me. Let’s say dimension 0 in our word encoding corresponds to some concept like “edible”. Dimension 0 in the position vector fluctuates rapidly and acts roughly like the low bit of a binary representation of the position. Adding the two together seems to overlay the two different concepts together arbitrarily, whereas concatenating them would add positional dimensions to the encoding as a new feature without interfering with existing features. A similar thing happens when adding the inputs of a layer to its output then normalizing.
Is there a fundamental concept that I’m missing here, or is this just kind of a “hack” to cut down on layer dimensions that ends up working because the training process can learn its way around it?
Let me offer my intuition which comes from this legendary paper where this image is from:
As you know, when we calculate derivative for addition it’s just 1 - so the gradient in the back-prop flows unimpeded (like in a highway). So in this picture you can imagine that the two leftmost cases have some bottle necks (road-tolls ) and the gradient flows with some obstacles to the source (embeddings for example) while on the rightmost case the gradient can flow more freely to the source and as a “side-quest” learn some interesting features/weights.
If we kept branching (by concatenating the outputs each time) we would have millions of features for a single example then how big would the dataset have to be? So, in order to learn, we somewhat constrain the dimensions but give the model some depth to learn interesting things.
That definitely does give me a clearer picture for the purpose of the feeding the prior input into the output via addition. Basically it gives a way to propagate gradients more directly back through the layers so a change can more easily have effects all the way down a deep stack.
The training still gets to optimize the output to be useful (for example it could just subtract out the x_L itself it it feels fit), but it now has a channel for backprop to chase a gradient all the way down the stack mostly unimpeded.
Can you please elaborate this a little more? In the 2-leftmost cases, and the right-mose case, isn’t the only difference in the positioning of the layers, namely the BatchNorm and the ReLU layers. Are you suggesting that by keeping the main track free from any BN or ReLU layers, the network is able to:
Learn more interesting features in the side-track
During back-propagation, easily able to differentiate between the gradients for the 2 paths, since, wrt these paths, the derivatives would be 1 for the addition operation.
By the way, I found this thread on Stack Exchange, discussing this exact query. What are your thoughts on this?
I like the comment in the SE thread Elemento found: “So this property shouldn’t add additional non linearity to the token embedding, but instead acts more like a linear transformation, since any change in position changes the the embedding linearly. In my intuition this should also enable easy separation of positional vs token information.” which specifically addresses adding the positional information.
It’s basically justifying it by saying that due to how the position is encoded it should be easy for the trained layers to separate the position encoding and word embedding without adding extra dimensions.
No, I’m not saying “more interesting”. I guess what I’m trying to say is more “gentle” features. If we would stretch the analogy (I hope not too much) then you can imagine yourself traveling with a car on a highway and your friend with the motorcycle turns to the regional roads, then you meet again and your friend tells his impressions/ his story which is not the same as the whole experience but somewhat helpful. But if you (no friend this time) traveled the regional roads, and the journey is very long (deep NN) then you might get lost eventually and won’t reach the destination.
Leaving analogy aside and going back to real numbers, BN and ReLU are modifiers of the numbers which eventually might be to much to backtrack (if we have 100 layers).
If I understand you correctly then yes, I’m suggesting that there is “enough” throughput for the gradient to go back to the roots (first layers) when we introduce the residual connections.
If we are talking about the positional encoding it’s somewhat different but related matter (to the residual connections). Also this post only talks about the fixed (sinusoid) positional encoding (it could also be learned). Also in my experience the sinusoid absolute positional encoding is presented on the side (it should be vertical but maybe it’s just a matter of implementation, what I’m saying that in my experience it should look like that:
The pictures are the same but it’s strange to see it on the side like that.
But if you would be asking why we add (instead of concatenating them?) then I would answer that we “fuse” them to the embedding dimension. As for intuition, I think we have to leave the intuition of the embeddings being fixed for white vs black for one feature, light vs heavy on the second feature, slow vs fast on the third and so on and to arrive to the intuition more like depending from the position where the feature is… like white_+0, light_+1, slow_+0 for the first word, then for the second word it’s white_+0.84, light_+0.54, slow_+0.53 and so on … to like the tenth word white_-0.54, light_-0.84, slow_-0.61. In other words the meaning of being white depends on absolute and relational values (after attention). If the positional encoding is fixed like this, then it “forces” to move embedding features. For example, if some features (like “white”) that are constant not depending where they are would “move” to the rightmost feature (for example, 25th column) and the features that are very dependent from where they are would "move to the leftmost columns. If the positional encoding is learned then this fuses the meanings more freely and the intepretation is even more complex.
I don’t think I’m doing good job explaining so if you would ask a more specific question I could try to answer.
The above intuition seems to be pretty good to me. I guess as of now, I am good with why we perform the “add” operation at different instances in a Transformer instead of the “concat” operation. But surely I will pick your brain some other time, with hopingly yet another intriguing concept
P.S. - Thanks @Gregory_Bush for initiating this query