Building a fine-tuned code review bot for an open-source Golang project

Hello everyone.

So I’ve been learning and reading everything about LLMs and trying to wrap my head around it all.

I’m fairly confident in knowing how to train and fine-tune once I have a dataset. What I still don’t know and I feel like the coursera course also glanced over this by using their own dataset, is dataset generation.

I have an idea for a project of fine-tuning bigcode/starcoder · Hugging Face for a more specific code suggestion on a relatively large golang codebase.

Then, would write a bot to use it for codde reviews or a coding buddy basically. I know huggingface had a post like, create your own copilot, but that used a dataset generated from Jupyter notebooks.

So my question is… how do I generate a dataset using Golang code, without handcrafting prompts? What’s the format that the dataset needs to be in? Can I write a program that generates it for me? I looked at some similar datasets on huggingface, but they all look like they have been handcrafted which is insane. :smiley: But if that’s what it takes I’ll do it, but I’m just not sure if there is anything better out there?

thanks!

Would this be a valid dataset format for fine-tuning?

[
    {
        "text": "func init() {\n\tutilruntime.Must(clientgoscheme.AddToScheme(scheme))\n\tutilruntime.Must(corev1.AddToScheme(scheme))\n\tutilruntime.Must(deliveryv1alpha1.AddToScheme(scheme))\n\tutilruntime.Must(v1.AddToScheme(scheme))\n\t//+kubebuilder:scaffold:scheme\n}",
        "metadata": {
            "file_name": "main.go",
            "file_path": "./cmd/main.go",
            "line_number": 50,
            "func_name": "init"
        }
    },
    {

Basically the text would be the function and then some metadata.