Homework of W1, I get confused at tf.Add

Chris.X · January 7, 2025, 10:37pm

Hey,

I cannot understand the following at CrossAttention part:

You need a way to pass both the output of the attention alongside the shifted-to-the-right translation (since this cross attention happens in the decoder side). For this you will use an Add layer so that the original dimension is preserved, which would not happen if you use something like a Concatenate layer.

It seems that we have no talk at the lecture for this case. Can anyone clarify this?

thank u

Deepti_Prasad · January 7, 2025, 11:35pm

hi Chris

the tf.afd or correctly said tf.keras.layers.Add( ) is used in the class cross attention, so the self.add layer used in the call cross attention, directs prediction of next word to right shifted translation when it concatenates target and attention outputs.

Class cross attention 

self.layernorm = tf.keras.layers.LayerNormalization()
self.add = tf.keras.layers.Add()


Call cross attention

x = self.add([target, attn_output])
x = self.layernorm(x)

This section of the code is already provided in the grade cell and you don’t have to write but for these codes to work, first in class cross attention you need to write the correct self.multihead attention and in call cross attention write the correction attention output for the right shifted translation to work.

Hope this clears your doubt.

Regards
DP

Chris.X · January 8, 2025, 8:11am

My question is not about that. I’m wondering why we need the Add() function. Can you provide more details or an intuitive example? Thank you.

Deepti_Prasad · January 8, 2025, 9:22am

to concatenate the target and attention output

arvyzukai · January 8, 2025, 9:28am

Hi @Chris.X

You should take a close look at the Attention Is All You Need paper and in particular this diagram:

As you can see there are “Add & Norm” all over the place. Transformers architecture is heavily dependent on the Residual Networks idea - each layer is adding on top of the current layer embeddings. You can learn more about this idea which was introduced by He et. al (2016), but to keep it short - the main idea is that Multi-Head Attention “adds” something (what it “thinks” that is necessary to be added) to the current embeddings and it also helps with gradients (so that we could have many layers).

Cheers

arvyzukai · January 8, 2025, 9:49am

Hi @Deepti_Prasad

Technically speaking, it is not the concatenation (which is “joining” or “stacking” together), it is the summation of the values.

Cheers

Deepti_Prasad · January 8, 2025, 10:09am

Welcome back arvy

Hope you are doing well now!!

yes by concatenation I meant joining.

arvyzukai · January 8, 2025, 2:00pm

Thanks @Deepti_Prasad

I’m not sure I expressed myself well. To clarify, what I meant is that tf.add() performs element-wise addition, not concatenation. For example:

a = tf.constant([1, 2, 3])
b = tf.constant([4, 5, 6])

Result of addition tf.add():
[5 7 9]

Result of concatenation tf.concat([a, b], axis=0):
[1 2 3 4 5 6]

In other words, I’m sure you know this but I wanted to make sure that the learner do not mix up these two when you used the word “concatenate”. Especially in the context of the original question (to see the second sentence, it needs to be scrolled):

Chris.X:

I cannot understand the following at CrossAttention part:
You need a way to pass both the output of the attention alongside the shifted-to-the-right translation (since this cross attention happens in the decoder side). For this you will use an Add layer so that the original dimension is preserved, which would not happen if you use something like a Concatenate layer.
It seems that we have no talk at the lecture for this case. Can anyone clarify this?

thank u

Also, the original “hint”:

so that the original dimension is preserved, which would not happen if you use something like a Concatenate layer

does not convey the idea well - it’s not just to “fit” the dimensions, but it is how the embeddings are being modified - by the “residual connection” (or using a plain word - “addition”).

Cheers

Topic		Replies	Views
Natural Language Processing with Attention Models C4W1_Assignment Exercise 2 NLP with Attention Models week-1	2	131	June 23, 2024
C4W1_Assigment_Exercise 2 - CrossAttention NLP with Attention Models week-1	9	425	January 7, 2024
NMT with Attention Model NLP with Attention Models	2	390	January 2, 2024
C4W1_Assignment - Translator NLP with Attention Models week-1	2	386	March 20, 2024
C4W1 - Cross Attention Exercise 2 and 3 NLP with Attention Models week-1	10	546	April 12, 2024

Homework of W1, I get confused at tf.Add

Related topics