Homework of W1, I get confused at tf.Add

Hey,

I cannot understand the following at CrossAttention part:

You need a way to pass both the output of the attention alongside the shifted-to-the-right translation (since this cross attention happens in the decoder side). For this you will use an Add layer so that the original dimension is preserved, which would not happen if you use something like a Concatenate layer.

It seems that we have no talk at the lecture for this case. Can anyone clarify this?

thank u

hi Chris

the tf.afd or correctly said tf.keras.layers.Add( ) is used in the class cross attention, so the self.add layer used in the call cross attention, directs prediction of next word to right shifted translation when it concatenates target and attention outputs.

Class cross attention 

self.layernorm = tf.keras.layers.LayerNormalization()
self.add = tf.keras.layers.Add()


Call cross attention

x = self.add([target, attn_output])
x = self.layernorm(x)

This section of the code is already provided in the grade cell and you don’t have to write but for these codes to work, first in class cross attention you need to write the correct self.multihead attention and in call cross attention write the correction attention output for the right shifted translation to work.

Hope this clears your doubt.

Regards
DP

2 Likes

My question is not about that. I’m wondering why we need the Add() function. Can you provide more details or an intuitive example? Thank you.

to concatenate the target and attention output

Hi @Chris.X

You should take a close look at the Attention Is All You Need paper and in particular this diagram:

As you can see there are “Add & Norm” all over the place. Transformers architecture is heavily dependent on the Residual Networks idea - each layer is adding on top of the current layer embeddings. You can learn more about this idea which was introduced by He et. al (2016), but to keep it short - the main idea is that Multi-Head Attention “adds” something (what it “thinks” that is necessary to be added) to the current embeddings and it also helps with gradients (so that we could have many layers).

Cheers

1 Like

Hi @Deepti_Prasad

Technically speaking, it is not the concatenation (which is “joining” or “stacking” together), it is the summation of the values.

Cheers

1 Like

Welcome back arvy :slightly_smiling_face:

Hope you are doing well now!!

yes by concatenation I meant joining.

Thanks @Deepti_Prasad :slight_smile:

I’m not sure I expressed myself well. :slight_smile: To clarify, what I meant is that tf.add() performs element-wise addition, not concatenation. For example:

a = tf.constant([1, 2, 3])
b = tf.constant([4, 5, 6])

Result of addition tf.add():
[5 7 9]

Result of concatenation tf.concat([a, b], axis=0):
[1 2 3 4 5 6]

In other words, I’m sure you know this but I wanted to make sure that the learner do not mix up these two when you used the word “concatenate”. Especially in the context of the original question (to see the second sentence, it needs to be scrolled):

Also, the original “hint”:

so that the original dimension is preserved, which would not happen if you use something like a Concatenate layer

does not convey the idea well - it’s not just to “fit” the dimensions, but it is how the embeddings are being modified - by the “residual connection” (or using a plain word - “addition”).

Cheers

2 Likes