Building a fine-tuned code review bot for an open-source Golang project

Skarlso · September 9, 2024, 5:46am

Hello everyone.

So I’ve been learning and reading everything about LLMs and trying to wrap my head around it all.

I’m fairly confident in knowing how to train and fine-tune once I have a dataset. What I still don’t know and I feel like the coursera course also glanced over this by using their own dataset, is dataset generation.

I have an idea for a project of fine-tuning bigcode/starcoder · Hugging Face for a more specific code suggestion on a relatively large golang codebase.

Then, would write a bot to use it for codde reviews or a coding buddy basically. I know huggingface had a post like, create your own copilot, but that used a dataset generated from Jupyter notebooks.

So my question is… how do I generate a dataset using Golang code, without handcrafting prompts? What’s the format that the dataset needs to be in? Can I write a program that generates it for me? I looked at some similar datasets on huggingface, but they all look like they have been handcrafted which is insane. But if that’s what it takes I’ll do it, but I’m just not sure if there is anything better out there?

thanks!

Skarlso · September 9, 2024, 6:34am

Would this be a valid dataset format for fine-tuning?

[
    {
        "text": "func init() {\n\tutilruntime.Must(clientgoscheme.AddToScheme(scheme))\n\tutilruntime.Must(corev1.AddToScheme(scheme))\n\tutilruntime.Must(deliveryv1alpha1.AddToScheme(scheme))\n\tutilruntime.Must(v1.AddToScheme(scheme))\n\t//+kubebuilder:scaffold:scheme\n}",
        "metadata": {
            "file_name": "main.go",
            "file_path": "./cmd/main.go",
            "line_number": 50,
            "func_name": "init"
        }
    },
    {

Basically the text would be the function and then some metadata.

Topic		Replies	Views
Collecting Custom dataset for fine tuning an open source LLM AI Discussions ai-discussions	0	89	February 8, 2024
Instruction finetuning dataset Generative AI with Large Language Models week-module-2	1	406	July 22, 2023
Finetuning a customized dataset Generative AI with Large Language Models week-module-2	4	75	July 1, 2024
How to create dataset on a specific topic to fine tune llm? Finetuning Large Language Models	0	189	November 27, 2023
Finetuning LLM inorder to generate personalized trip plans AI Discussions ai-discussions	0	106	February 17, 2024

Building a fine-tuned code review bot for an open-source Golang project

Related topics