How does next word prediction work for language translation?

Do we need to have 2 different LLM for language translation vs generative task? If we are using the same general LLM for next word prediction of “I love machine”, how does the system know to translate this sentence to a different language (e.g French), or to predict the next word in English according to current context (e.g next word prediction to be learning, making the full sentence to “I love machine learning”)?

I have this question when watching the video regarding Transformer architecture when you feed the initial word to the input of Decoder.

This is an interesting question @QihangHuang.

I would like to split the answer in two:

Answer 1: Seq-to-Seq models
These models are built with the 2 parts of a transformer: an encoder and a decoder. In this case, the encoder takes the source sentences and produces a vector that contains the semantics of the sentence. This vector is passed to the decoder where, based on the semantic vector plus the patterns learned in training, it starts predicting one word at a time. After the 1st word, all other words have as input the semantic vector from the encoder plus all the words predicted so far. And this is how it works for Seq-to-Seq models.

Answer 2: Decoder-only models (like GPT - ChatGPT)
These work differently. If you just say “I love machines” it will add more text in english. To have it translate, you have to start the prompt with “Please translate to French the following sentence: I love machines”. In this case, the decoder is 100% “guessing” the next word, and the next and the next. Again, the reason why it works is in part thanks to the huge huge amount of data with which it was trained, which allows it to follow patterns and be good at predicting very well the next word, and second, and most interesting, it works due to reasons that are still unknown.

Please share thoughts, more questions, or comments!

Fascinating topic.