What makes decoder only models are so successful (like ChatGPT, BARD), when ideally encoder+ decoder model is considered to have more in depth understanding of the meaning of the input?
I am not sure I would compare them like that. They are two different models with two different missions, although some tasks may overlap.
The decoder-only ‘guesses’ the next token. That’s all it does.
The encoder-decoder converts one sequence into another sequence.
Someone can say: Yes! But decoder-only can also translate from English to French and this is seq-to-seq! And I would answer: It is seq-to-seq for us humans, but for the decoder-only model was just ‘guessing the most probable next word’.
If I had to build a language translator or any other task to go from one sequence (audio, text, etc) to another sequence (text, other language, etc) I would do it with an encoder-decoder. But yes, big decoder-only models like GPT can do a great job as well, but they have to be huge .