I have gone thru the course “How Transformer LLMs Work” by Jay & Maarten.
Can you pl answer below questions.
What is the actual supervised training data and what is the label information? The next word in the sequence is label?
Where is the actual use of self attention scores? How is it used?
How is the initial embeddings of tokens generated?
How is positional encodings generated and added to the word embeddings?
What is layer normalization?
I expect a quick and prompt reply from your end to my email id skgadalay@gmail.com.
Thanks in advance.