Some questions about LLM Training

Hi, I want to train an open-source model to automate code generation for a programming language. But I have some questions about LLM Training:

  1. how to choose an appropriate open-source model?
  2. How to collect data? How to normalize the collected data?
  3. how to use these data to train the open-source model?
  4. how to fine-tune the training to achieve the best results?
  5. How to apply the trained model in real scenarios?

Could you please help me answer those questions? Or provide some learning materials that I can refer to.
Any input will be appreciated!

Meta spent about a gazillion dollars researching this, for the coding tools in their Llama 2 project.

It’s quite a lot to handle doing this by yourself.

Maybe explore the state of the art a little by attending the Llama 2 short course.

Instead of a large model trained on massive amounts of data, I wanted a small, proprietary model (less than ten billion) trained only on the rules and paradigms of a programming language that already had a web site with instructions and code files for related paradigms. Through this learning task, I hope to learn how to help companies train their own specialized models with vertical professional depth.

Is there anything wrong with my idea? I hope to get your guidance.

Sorry, I don’t have any specific guidance in this area.