How to optimize a transformer model (at a code level)

swarnava2014 · April 10, 2024, 3:59pm

I have coded a basic transformer using the Course 4 on NLP Specialization. Now I want to know, how can the basic structure of a transformer be improved. It would be great if someone can give me some pointers on how to go about it. Like, what are the principles models like LLAMA, GEMMA use to improve the code of the transformer model. Thanks in advance.

jyadav202 · April 11, 2024, 4:40am

Hi @swarnava2014 ,

It is encouraging to know that you would like to understand what goes into the building of SOTA LLMs. These models are not standalone, it is good to view them as a system/product rather than just a transformer model.

You can read articles, review short courses (we have plenty in DLAI), educational videos explaining GPT and throw some research papers into the mix. There is different sides of these models that you can venture into, some of them being:

GPT architecture (start here)
Scaling Laws
LLM pre/post training algorithmic optimizations
AI Alignment/AI Safety
and many more… Ofcourse there are other considerations to make like large-scale data handling, GPU /TPU programming etc. Check this thread for some inspiration.