Performance metrics for evaluating generated code

batton · September 16, 2024, 4:48am

I’m just learning about techniques such as ROUGE and BLEU for evaluating trained models.
I’m interested in training a model to generate computer code. Is there a metric commonly used to evaluate the performance of such models?

I could see this being a challenge since not only must the code give the correct algorithm, but the algorithm has to be syntactically correct with respect to the chosen programming language.

Would appreciate assistance on my limited understanding of this area.

batton · September 16, 2024, 4:59am

Maybe I’m overthinking this here - could it be enough of a metric to measure two things:

The generated code compiles, and
The generated code gives the correct expected answer?

If that was the case then running a test framework with the generated code should provide a satisfactory metric.

carloshvp · November 10, 2024, 3:16pm

Hi @batton
For evaluating code-generating models, I think you are on the right track! Evaluating generated code does present unique challenges, and traditional metrics like BLEU and ROUGE can provide some insight, but they fall short for code-specific evaluation.

As I am sure you are aware, BLEU and ROUGE measure similarity between generated text and reference text, which works well for natural language but can miss the mark for code. Even slight differences in syntax can lead to the same logical outcome, meaning a model’s code could have low BLEU/ROUGE scores despite being functionally correct.

Commonly Used Metrics for Code Generation

Exact Match (EM): Measures if the generated code exactly matches a reference solution. This can be useful but limited since multiple syntactically different code solutions can still be correct.
Compilation Success: You mentioned checking if the generated code compiles, which is a great metric since it ensures syntactic correctness. However, it only verifies that the code is “runnable” and doesn’t confirm that it’s logically correct.
Functional Correctness: This is crucial. Running the generated code against a set of test cases to verify the output aligns with expected results is a highly reliable measure. A common approach is to set up a suite of tests covering different scenarios and edge cases, then calculate the pass rate as a metric of accuracy.
Code Quality Metrics (Optional): You can also evaluate readability and maintainability with metrics like cyclomatic complexity or lines of code (LOC), which measure structural aspects of the code. These might be less relevant for model training but can still inform you about the code’s complexity and clarity.
LLM as a Judge or Knowledge Distillation (Optional but cool): You could finally use a highly advanced LLM (e.g. OpenAI’s o1) to review the generated code and decide if the code is good enough, also considering additional coding guidelines, dynamic code execution characteristics (e.g. memory consumption, execution time, etc.) or other considerations you might think are relevant. Some people would call this LLM as a judge, others Knowledge Distillation. I think both are right.

Suggested Approach

For code-generation models, a two-stage evaluation is often most effective:

Syntax Check: Ensure the code compiles without errors.
Functional Test: Run the generated code against multiple test cases to verify that it produces the correct results.

Using a test framework to execute these functional checks is a great idea and provides a reliable evaluation method that focuses on the ultimate goal: producing correct, working code.

By combining these metrics, you’ll gain a more holistic view of how well the model performs in generating syntactically correct, functional, and quality code.

Let me know if you have made progress, I would be very interested to learn more!

Topic		Replies	Views
Metrics QA LLMs Generative AI with Large Language Models week-2	1	355	October 30, 2023
Week 2 - Fine-tuning and LLM Evaluation in practice Generative AI with Large Language Models week-2	1	426	July 27, 2023
ROUGE and BLEU metrics Generative AI with Large Language Models week-2	4	512	September 18, 2024
Advanced LLM evaluation techniques Generative AI with Large Language Models week-2	2	715	July 28, 2023
When to know LLM model is good enough? Generative AI with Large Language Models week2	3	39	January 29, 2025

Performance metrics for evaluating generated code

Commonly Used Metrics for Code Generation

Suggested Approach

Related topics