How to optimize a transformer model (at a code level)

I have coded a basic transformer using the Course 4 on NLP Specialization. Now I want to know, how can the basic structure of a transformer be improved. It would be great if someone can give me some pointers on how to go about it. Like, what are the principles models like LLAMA, GEMMA use to improve the code of the transformer model. Thanks in advance.

Hi @swarnava2014 ,

It is encouraging to know that you would like to understand what goes into the building of SOTA LLMs. These models are not standalone, it is good to view them as a system/product rather than just a transformer model.

You can read articles, review short courses (we have plenty in DLAI), educational videos explaining GPT and throw some research papers into the mix. There is different sides of these models that you can venture into, some of them being:

  1. GPT architecture (start here)
  2. Scaling Laws
  3. LLM pre/post training algorithmic optimizations
  4. AI Alignment/AI Safety
    and many moreā€¦ Ofcourse there are other considerations to make like large-scale data handling, GPU /TPU programming etc. Check this thread for some inspiration.