Questions regarding course 4 week 1

Q-1: Can you explain AttentionQKV() function shown in assignment ?


Suppose if input to AttentionQKV is as below:
Query = (64, 32, 1024)
Key = (64, 32, 1024)
Value = (64, 32, 1024)
Mask = (64, 1, 32, 32)

which produces output as
Activation:(64, 32, 1024)
mask: (64, 32, 1024)

Q-1(A): In lab assignment, we calculated scaled-dot-product based on Query, key and value. But from the above implementation
cb.Parallel(
core.Dense(d_feature),
core.Dense(d_feature),
core.Dense(d_feature)
)

it is first converting Query, key and value into another form. Why? Shouldn’t we use the original encoders’ & decoders’ output?

Q-1(B): After PureAttention() output it has implemented core.Dense(d_feature) which converts output to another form. Why? Shouldn’t we use original output of scaled-dot-product implemented by Pure_Attention()?

Q-1(C): What is n_heads & dropout parameters in PureAttention()? There was no parameter I got familiar with “Scale-dot-product assignment”

Q-2: In Lab assignment Basic-Attention-Operation I learned alignment score which is used to make context hidden state. But in week’s assignment, I didn’t see any method which uses this concept. Why?

Q-3: What is meaning of (activation + query) which is implemented by Residual layer? Won’t it change scale-dot-product output implemented by Pure_Attention()?

Q-4: Does trax.layers.LSTM(n_units = d_feature) output cell state or hidden state when we give input of d_feature dimension?

Hi Aayush_Jariwala,

Here’s my two cents.

The inputs are fed through a number of Dense layers similar to the way this is done in multi-head attention layers that are discussed in the next weeks of the course which discuss transformers. So, this seems to be done to take a step into the direction of the transformer architecture which is more effective than a simple dot product attention.

The parameters n_heads and dropout also indicate that certain steps in the direction of the transformer are included in this assignment. You will learn about these parameters in the next weeks.

With regard to your second question, the idea of alignment is implicit in the assignment. E.g., section 2.1 contains the passage “the model might have learned that it should align to the second encoder hidden state and subsequently assigns a high probability to the word “geht”.” The lab on basic attention provides an introduction to this, whereas the assignment mostly handles it implicitly.

As to your third question, residual layers aim to avoid vanishing gradients. This is why the query is added to the activation. For an explanation, you can e.g. have a look at this post.

You can find an answer to your fourth question here.