What does the AttentionQKV layer do when not in 'train' mode

The AttentionQVK layer requires Query, Key and Value matrices, all of which are present during training. But during testing, only the Query matrix will be available. Is it just passed through when the net is not being trained?

Hi Steven1,

The input to AttentionQKV comes from PrepareAttentionInput. In its turn, PrepareAttentionInput receives encoder activations and decoder activations as inputs. The encoder activations derive from the input tokens, which will be present during training, testing, and prediction. So also during testing and prediction, there will be keys to be matched with queries in order to select values.

Hi @reinoudbosch,

Thx. for responding, but I’m still a little confused.

PrepareAttentionInput receives both encoder & decoder activations as inputs. Although the encoder activations derive from input tokens, the decoder activations depend upon target tokens, which are not present during evaluation or testing.

Without target tokens, what does the pre-attention decoder feed to the prepare input layer?

I thought the query matrix was composed of embedded vectors with some extra positional information from the source (i.e. input) language. I thought the key and value matrices depended only upon the target language. Am I wrong about this? (Upon reviewing, I think I am. The encoder, which I associate with the source language produces the K & V matrices. The decoder, which I associate with the target language produces the Q vector. I had thought it was the other way around.)

Hi Steven1,

eval_task uses eval_batch_stream, which is based on eval_stream_fn which produces the same type of data as is produced by train_stream_fn. Cell 3 of the notebook provides an example of eval data, which includes the target tokens that are evaluated against, i.e., target tokens are present during evaluation and testing.

trax.supervised.training.Loop uses these targets to determine the accuracy of the predictions during evaluation. You can find the source code of this class here. The evaluation is done in def run_evals, which starts at line 708 in the source code.

Evaluation and testing thus boils down to the same forward pass through the model as occurs during training, but in contrast to training, during evaluation and testing there is no backward pass during which parameters are adjusted.

Hi @reinoudbosch,

I’m sorry to be taking up so much of your time, but maybe this example will pinpoint the source of my confusion:

Imagine we have trained machine to translate English to German. I want to know how to say “The door is green” in German.

I would start by tokenizing the English sentence. These tokens would be the input tokens. They get passed to the input encoder, which produces the encoder activations.

At the same time, target tokens are passed to the pre-attention decoder, which produces decoder activations.

The encoder & decoder activations are passed to the pre-attention input function along with another copy of the input tokens, which produces queries, keys, values and masks.

However, the target tokens are a tokenized version of the translation into German of “The door is green”.

If I already have target tokens, what do I need this net for? I should only need to detokenize. If I don’t have target tokens, what is the pre-attention decoder using to produce the decoder activations?

Again, thanks for taking so much time for something I seem to have an odd mental block on.

Hi Steven1,

Now you are asking about prediction, which is different from evaluation and testing in that no targets are used.

Normally, to get the translation starting, a start of sentence token would be used which would produce a first query from a decoder.

In this assignment, as I understand it, def sampling_decode starts with an empty list as output, which is passed to next_symbol where it is expanded to an array of zeros of shape (1, padded_length).This is then passed to NMTAttn as padded_with_batch, which goes to the pre_attention_decoder. So, for the first word, the query does not do anything and the translation needs to come from the LSTM and Dense layers into which the residual connection that bypasses the attention layer feeds. Once a first output token has been produced and appended to the list, the query becomes different from a vector of all zeros and the attention layer starts to contribute to the translation. Somewhat confusing indeed.

Hi reinoudbosch,

You’re right, I meant prediction all along. I’m generally much easier to understand when I use correct terminology. I’m going to have to think about your ultimate answer for a bit.

Thanks again!