In the modified architecture, why the inputs and outputs are used again, outputs seems to be there for the sake of second level decoder but couldn’t figure out the reason for input to be used again while preparation for attention, is it used as K in one place and V in another instance ?
Yes that’s it. The inputs are used again in the preparation for attention because they act as keys and values in the attention mechanism, allowing the model to determine which parts of the input should be focused on for each generated output token.
Q, k, v are used to create matrix in encoder, attention and decoder with respect to the input fed and create a vector dimensions values to other inputs (other inputs are tokens or sequence other than the inputs fed).
But attention mechanism what gives added significance is it focus on the input fed from the encoder and mask other inputs/tokens/sequence to focus on the target in attention mechanism. So when the input is fed in decoder, it tries to refer to this attention mechanism which is more focused towards target provided, giving better translation.
Such techniques are helpful when it comes long sequence or long sentence, where attention mechanism helps decoder to focus to sequences targeted in the attention mechanism.
Feel free to ask if any doubt.
Regards
DP