Why use Encoder-Decoder Models?

I’m watching the wk 1 video on “pre-pretraining large language models”

She states that encoder-decoder models are good for translation or summarization. But can’t those tasks be done with a decoder-only model? Why use an encoder-decoder model then?

1 Like

If you want to understand that level of detail, it might be a better idea to take DLS Course 5 or NLP Courses 3 and 4. That’s where you learn the technical underpinnings of LLMs. In the short courses they are just showing you how to apply them or build apps on top of them.

The TL;DR version is that you are “encoding” the inputs into an “embedding space” and then decoding that result “outward” into the end result that you want. So think of the encoding step as figuring out what the input says or distilling it into the relevant meaning for whatever the output is that you eventually want. Then the decoding phase takes that “distilled meaning” and maps it to your desired output interpretation of that.

1 Like

I understand that, but can’t decoder only models accomplish the same task? like GPT-4 can do translation, question answering, summarization. It just does it in a different way, by predicting next token. e.g. “how to you say hello in spanish?”, the predicted next token in a well-trained decoder model would be “hola”.

1 Like

Evidently the Encoder/Decoder strategy works better in some respect otherwise they wouldn’t use it. Perhaps it’s easier to train or perhaps it gives better results in general. One can imagine that the dual architecture is a lot more flexible, meaning that you could use that to generate translation into multiple output target languages.

So maybe the real answer here is that I’m not the right person to be answering this question. Sorry I should not have waded in on this thread, since I am not a mentor for this course.

Also note that GPT-4 is the very latest LLM architecture, so it is based on Transformers and Attention Models and those all use the Encoder/Decoder architecture. I have not studied the internals of any of the published GPT models specifically, but have taken DLS Course 5 which covers Sequence and Attention Models. If you are just using GPT-4 through some chat interface, then yes, it looks like a straight decoder style in terms of the results, but the point is that what’s actually happening “under the covers” is all based on Transformers and the Encoder/Decoder architecture is fundamental to how they work.

1 Like