Week 1
generative-ai-with-llms/lecture/R0xbD?t=220
-
It was mentioned that models like GPT, Llama etc use the decoder only architecture models. (a) How does it work without the context provided by the encoder? (b) What is the context used for?
-
We learnt that multi head attention is to assign random weights to generate token associations with some meaning/relevance. What is the difference between the multi-head attention (in encoder) vs the masked multi-heat attention (in decoder)?