Questions about transformer architecture

Week 1
generative-ai-with-llms/lecture/R0xbD?t=220

  1. It was mentioned that models like GPT, Llama etc use the decoder only architecture models. (a) How does it work without the context provided by the encoder? (b) What is the context used for?

  2. We learnt that multi head attention is to assign random weights to generate token associations with some meaning/relevance. What is the difference between the multi-head attention (in encoder) vs the masked multi-heat attention (in decoder)?

For the 1st:
Unlike encoder-decoder models, which use an encoder to process and understand the input context before generating output, decoder-only models rely on the context built from the input sequence itself.

The second:
The key difference is the use of masking in the decoder to ensure that future tokens are not visible, preserving the autoregressive nature of text generation.

But I would suggest to do the Natural Language Specialization here to understand in depth.

1 Like