Inference for NMT

Not sure if I missed it but the course did not seem to explain how inference is done for NMT. In particular, what are the inputs for the encoder and decoder respectively?

Hi @Peixi_Zhu

I’m not sure the course did not explain it, but the inputs at inference:

  • for the encoder is the text in language you are translating from;
  • for the decoder is the text you (the model) translated so far.

Both inputs are tokenized and padded.

1 Like

Thanks, @arvyzukai

Say for translation of English to German, during training English tokens are fed into the encoder as key and value while German tokens are fed as query. The decoder also consumes German tokens.

During inference, however, the German sentence is not available, so how does transformer handle it?

No. Encoder only consumes English tokens. In the Assignment the encoder (and decoder for that matter) is implemented as LSTM and does not have queries, keys and values (as in the original “Attention Is All You Need” paper). (btw you posted your question under wrong Week, if we are talking about translation)
But the same thing is in the original paper - encoder only consumes English tokens (and queries, keys and values are made from English Embedding+Positional encodings).

No. The decoder consumes English and German tokens. Loosely speaking, it tries to complete the sentence in German while “peaking” at English sentence.

Well, at inference when the model tries to generate first word, it’s decoder’s input is an empty (padded) tensor. It tries to translate by looking at encoder output (English sentence representation) and the current translation (which is empty).

The model outputs the whole German sentence (the whole empty tensor is populated with predictions) but we usually pick only the first token in this case.

Then the next step for translation:

  • the encoder outputs are the same (since the English text hasn’t changed)
  • but the decoder inputs now are not an empty tensor, but the tensor which contains the first word predicted (and padding)
  • now again the model tries to predict the whole translation, but this time by looking at what is already predicted (one word) and what is in the encoder output.

This time we pick what is the second token in the translation and append it to first token.

Then the next next step for translation:

  • the encoder outputs are again the same (since the English text hasn’t changed)
  • but the decoder input now is the tensor which contains the first and second word predicted (with padding)
  • now again the model tries to predict the whole translation, and again looking at what is already predicted (two words) and what is in the encoder output.

We pick the third token and append it to the first two. And so on…

Cheers

1 Like

Hi,
I was actually asking the transformer model in week2. The lecturer used English to German translation as example. I should not have called it NMT since NMT is the attention with LSTM in week 1. Sorry for the confusion.

In the transformer model, as I mentioned above, the encoder consumes English tokens and German tokens, and the decoder consumes German tokens. During inference, the German text is not available so what do the encoder and decoder consume other than English text?

:slight_smile: Well, in Week 2 we implement the Transformer Summarizer and there are no German tokens (unless you call the summary the German tokens).

But what I explained before applies (would apply) to this week too (edit: if we would have used the original Transformer architecture like in translation week):

  • encoder consumes only the text to be summarized (“long” English text).
  • decoder consumes both the text to be summarized (“long” English text) and the summary up till this point (“short” English text);

But, if we are talking strictly about the Course 4 Week 2 Assignment, then in that week we use a different architecture - only the decoder (the generator). The input for training is in the form:

  • [Article] → <EOS> → <pad> → [Article Summary] → <EOS> → (possibly) multiple <pad>

In other words, the model only punished/rewarded on [Article Summary] section.
So the input to generate the first word is (for inference first step):

  • [Article] → <EOS> → <pad>

And the model tries it’s best to continue the sequence. (The right side of the original Transformer architecture).

1 Like

For the encoder, I was specifically referring to the slide here which came from dot-product attention in week 2. My understanding is that this is a translation task and the German text is consumed by the encoder as a query, correct?

During inference, since a complete translated German text is not available, what is used as query? Is it the German text that has been translated up to this point? If so, then my understanding is that inference still needs to be processed sequentially and word by word, right?

Hi @Peixi_Zhu

I’m not a big fan of graphics like this because they are context specific and also the interpretation is loose. But this is almost 100% not an encoder slide. Can you pinpoint where exactly in week 2 (or any other week) this slide is? (I don’t remember every slide in the course and I could not find it quickly.)

Not sure if I should answer this point without getting the first one right. But to be brief, in translation (English → German):

  1. in the encoder - 1 Self-attention - Q, K, V are created from the same input (hence “Self”) - only English.

  2. in the decoder - 2 Causal-attention (only attend self and prior tokens) - Q, K, V are created from the same input - only German.

  3. in the decoder - 3 Cross-attention (attend to other language tokens) - Q from German (2), K and V from English (1).

Yes, during inference (and training) at Cross-Attention (3) the Q is created from the 2, which was fed the outputs so far.

For a better translation (during inference) - yes. (But the outputs of the decoder are the whole sequence, we just pick the one at the index that we need).

Cheers

P.S. this should be discussed under Week 1 - since Week 2 is decoder-only architecture.

1 Like

I came from Coursera. Somehow I am no longer able to find that slide since I recently switched from audit mode to paid mode. Looks like the videos have become slightly different.

However, I found this slide from time 1:59 of the video Scaled and Dot-Product Attention in week 2. This slide basically says the same thing, which is German text for Q and English for V and K.

Hi @Peixi_Zhu

You didn’t mention me in your reply and I didn’t notice that you responded.

Ok, the slide explains Dot-Product Attention (which is “one piece of a puzzle” in this week, but its context (translation) is not exactly the same as this week’s context (summarization))

So, can you formulate what is the question you want to ask?

Hi, @arvyzukai

My question is that for the translation task which use dot-production attention above, how does the transformer model do inference (e.g., English → German) since the target language is not available. Given what we have discussed so far, I think I already the answer but please let me know if my understanding below is wrong.

During inference the encoder use the translated German up to this point to compute Q. So inference has to be done sequentially. At the very first step (when no German word has been generated), Q is computed from the start-of-sentence token only.

Hi @Peixi_Zhu

In our case (translation from English to German), the German sentence is being generated by the decoder.

Well, the model outputs the whole sequence probabilities, but for translation (during inference) we usually use “our own code” to pick one word at a time (that makes the translation sequential (better consistency)).

“start-of-sentence token only” and padding. (I think you understand this, but just wanted to make sure - the input for the decoder is always the same length).

Cheers