I am slightly confused b/w encoder only and decoder only differentiation.
If Bert is encoder only, how do they produce summarization and if GPT is decoder only how do they do classification tasks.
Also, encoder only architecture requires input and output length to be same. How come they do classification task?
Thank you!
Hi @himsgpt ,
Interesting questions.
In all cases you will see that to achieve these tasks you have to:
- Add additional layers to the standard model to accomplish the task
- Make sure that the training is appropriate for the task at hand.
Lets go one by one:
How can an encoder-only model do summarization?
An encoder-only model is great to extract the main features of an input. At the end of the encoder, we have a vector that has captured the “meaning” of the input. To convert this into a summarization you will need to add more layers to process this vector into the summary. Some common layers would be feed-forward layers followed by an auto-regressive model.
How can a decoder-only model do classification?
Although decoder-only models are best fit for a sequence-to-sequence task, they can be adapted also to perform a classification. To do this, you need to add a classification head to the end of the model, that is, some more layers that can handle the output of the model, pass it through some feed-forward layers and finish with a softmax layer. You also have to train the model for this task.
How can an encoder-only model, which requires input and output length predefined, do classification?
Similar to the previous case, you can process the output layer of the encoder by passing it through a classification layer that will usually include a softmax. Also, you need to train the model for the task of classification.
You may say: But GPT-3.5/GPT-4 can do it without all of these extra layers and steps! And you are right. You can have GPT-3.5/GPT-4 to do classification and so many other things. And this is because of the size of the model. These models are very powerful and can handle very well many tasks, and that’s what has us all so in awe! But for models of 7B, 13B and such size, which you can manage and train and fine-tune in “normal” computers, you need to consider the steps above.
Final comment:
As you can see the models can be adapted to perform tasks that are, lets say, not natural for them. Getting the models to execute these tasks can be done but you may not get the best results. The best results can be achieved by using the ideal models for each task. But this proves the strength of the transformers!
Thoughts?
2 Likes
Thanks @Juan_Olano for the comprehensive answer.
That does makes sense, and it seems any model architecture can be tweaked to perform any sets of tasks but as you suggested, we should use the model for the task they are designed for.
One thought, while I see a lot of leaderboards on LLM benchmark, none of it(in my knowledge) covers the comprehensive task list, that can act as a Playbook. It would be good to include comprehensive tasks ranging from classification to summarization to text generation for each models/architecture.