Q-1: Can you explain AttentionQKV() function shown in assignment ?
Suppose if input to AttentionQKV is as below:
Query = (64, 32, 1024)
Key = (64, 32, 1024)
Value = (64, 32, 1024)
Mask = (64, 1, 32, 32)
which produces output as
Activation:(64, 32, 1024)
mask: (64, 32, 1024)
Q-1(A): In lab assignment, we calculated scaled-dot-product based on Query, key and value. But from the above implementation
cb.Parallel(
core.Dense(d_feature),
core.Dense(d_feature),
core.Dense(d_feature)
)
it is first converting Query, key and value into another form. Why? Shouldn’t we use the original encoders’ & decoders’ output?
Q-1(B): After PureAttention() output it has implemented core.Dense(d_feature) which converts output to another form. Why? Shouldn’t we use original output of scaled-dot-product implemented by Pure_Attention()?
Q-1(C): What is n_heads & dropout parameters in PureAttention()? There was no parameter I got familiar with “Scale-dot-product assignment”
Q-2: In Lab assignment Basic-Attention-Operation I learned alignment score which is used to make context hidden state. But in week’s assignment, I didn’t see any method which uses this concept. Why?
Q-3: What is meaning of (activation + query) which is implemented by Residual layer? Won’t it change scale-dot-product output implemented by Pure_Attention()?
Q-4: Does trax.layers.LSTM(n_units = d_feature) output cell state or hidden state when we give input of d_feature dimension?