Sequence to sequence vs autoregressive models

In this video - ’ Pre-training large language models’, three types of llms are introduced: autoencoder, autoregressive and seq to seq. It mentions seq to seq is best for translation task and autoregressive is best for text generation.
But GPT-3/chatgpt as a autoregressive model, perform well for many language tasks other than text generation, such as text translation and sentiment analysis. Does the categorization in mentioned video still make sense?
It seems to me seq to seq model is not necessary for application any more (except for training and language modelling), as autoregressive model already can cover its area of expertise.
Please help clarify if my assumption is correct/wrong, thank you.

Hi @Claire_Gong ,

Thank you for your insight!

Autoregressive models are really the surprise of the moment in many regards. As the lecture states, these models not only generate text but also the trainer says that new applications are being discover with these models, and this is a field of research.

As you properly point out, translation and sentiment analysis, among others, are some of these other abilities of these models.

Still, I would not support, at least yet, the idea of discarding Encoder-Decoder models for seq-to-seq tasks. Although the Decoders-only models are very good at it, the Encoder-Decoder architecture may have strengths over the Decoder-only. For example: The decoder-only is predicting the next token by only looking at the past tokens, while the encoder-decoder models, in the encoder part, are looking at the entire context and passing Keys and Values to the decoder.

Yes, the decoder-only models are doing an amazing job, but I would argue that the encoder-decoder has strengths, and may be visible to us in some cases, over the decoder-only model.


One more thought:

What you mention of LLMs being able to translate and do other tasks that seq-2-seq do, that is seen mainly (and I would say ‘only’) on the very LLM, think GPT and Claude.

Smaller LLMs are not that good at these tasks, however small Seq-2-Seque are good at this tasks.

So the size of the LLM is a variable to consider. Not everyone can afford to have its own GPT model, but a seq-2-seq for translation, for instance, is doable at a much much lower cost.

When size is small the differences make more sense. Thanks for your clarification!