Residual Layer in Assignment

Below line is in the NMTAttn function
# Step 5: run the AttentionQKV layer
# nest it inside a Residual layer to add to the pre-attention decoder activations(i.e. queries)
tl.Residual(tl.AttentionQKV(d_model, n_heads=n_attention_heads, dropout=attention_dropout, mode=mode)),

I think before this line is called, we have
[q, k, v, mask, target_token]

after AttenQKV is called, we have
[activation, mask, target_token]

then when Residual is called how can the sequence pop ‘queries’ to combined with activation?


If you look at the source code here (search for Residual), you will see that the layer includes a Branch in which a None argument acts as if it takes one argument which it leaves unchanged. This happens before calling AttentionQKV, so Residual does not have to pop queries after calling AttentionQKV.