Based on my understanding pre-attention takes target token as inputs , but during inference we will not have target tokens, how does the pre attention works at that time? TIA
Hi, @vishwas1!
I’m not specialized in the NLP course, but generally attention mechanisms work with tokenized inputs to produce a sequence of output vectors, one vector for each token (word). In the process, they use the matrices P_q, P_k and P_v to generate the query, key and value vectors to perform the attention calculation as described in (Vaswani et al.).
During inference time, the input tokens are generated the same way, but the target tokens are no longer needed since we are not training and we don’t backpropagate the loss function. We just make a forward pass to get the output vectors sequence.
Thanks @alvaroramajo for the quick answer.
Does that means during inference we dont use pre-attention?
Pre-attention is used only during training for teacher forcing and weight sharing issue. So during inference we use decoder hidden state for attention.
Exactly. In attention mechanisms, the already trained parameters are used during inference time
Hi @alvaroramajo , @vishwas1 ,
I don’t believe that is accurate, ie. pre-attention is indeed used even during inference and you can tell it is so from the code used for the inference. It uses the same model and passes the input and ‘current output’ tokens into the model. In fact, there is quite a lot of wasted computation when recursively feeding in the next symbol into the model and each time that happens, the input tokens are also fed forward through the encoder again, with the same resulting encoder activations.
I would have pasted the code here but it will include part of the solution to the assignment and if the moderators are OK with it, I can edit this post with it. You can look at the ‘Decoding’ section of the NMT assignment (NLP Course 4, week 1).
Please let me know if I’ve missed anything. thanks.
The AttentionQKV layer receives a ‘mode’ parameter that can take the values ‘train’, ‘eval’, or ‘predict’. So it is used, however, the actual difference in operation between modes is not clear still.
Can anyone please clarify the original question and what the differences are between operation modes?
From the lectures that compared QKV attention to the basic attention I understood that the QKV only involved matrix multiplication of pre-aligned matrices. The basic attention used a feedforward network to learn those alignments.
So from that I get that AttentionQKV is not learnable, therefore that would not be a difference between its operation modes. Is this correct?