I cannot understand the following at CrossAttention part:
You need a way to pass both the output of the attention alongside the shifted-to-the-right translation (since this cross attention happens in the decoder side). For this you will use an Add layer so that the original dimension is preserved, which would not happen if you use something like a Concatenate layer.
It seems that we have no talk at the lecture for this case. Can anyone clarify this?
the tf.afd or correctly said tf.keras.layers.Add( ) is used in the class cross attention, so the self.add layer used in the call cross attention, directs prediction of next word to right shifted translation when it concatenates target and attention outputs.
Class cross attention
self.layernorm = tf.keras.layers.LayerNormalization()
self.add = tf.keras.layers.Add()
Call cross attention
x = self.add([target, attn_output])
x = self.layernorm(x)
This section of the code is already provided in the grade cell and you don’t have to write but for these codes to work, first in class cross attention you need to write the correct self.multihead attention and in call cross attention write the correction attention output for the right shifted translation to work.
As you can see there are “Add & Norm” all over the place. Transformers architecture is heavily dependent on the Residual Networks idea - each layer is adding on top of the current layer embeddings. You can learn more about this idea which was introduced by He et. al (2016), but to keep it short - the main idea is that Multi-Head Attention “adds” something (what it “thinks” that is necessary to be added) to the current embeddings and it also helps with gradients (so that we could have many layers).
I’m not sure I expressed myself well. To clarify, what I meant is that tf.add() performs element-wise addition, not concatenation. For example:
a = tf.constant([1, 2, 3])
b = tf.constant([4, 5, 6])
Result of addition tf.add(): [5 7 9]
Result of concatenation tf.concat([a, b], axis=0): [1 2 3 4 5 6]
In other words, I’m sure you know this but I wanted to make sure that the learner do not mix up these two when you used the word “concatenate”. Especially in the context of the original question (to see the second sentence, it needs to be scrolled):
Also, the original “hint”:
so that the original dimension is preserved, which would not happen if you use something like a Concatenate layer
does not convey the idea well - it’s not just to “fit” the dimensions, but it is how the embeddings are being modified - by the “residual connection” (or using a plain word - “addition”).