Generated Code Generates Overconfident Coders: Copilot AI tool may cause programmers to write buggy code


Tools that automatically write computer code may make their human users overconfident that the programs are bug-free.

What’s new: Stanford University researchers found that programmers who used OpenAI’s Codex, a model that generates computer code, were more likely to produce buggy software than those who coded from scratch.

How it works: The authors recruited 47 participants, from undergraduate students to professional programmers with decades of experience, to complete security-themed coding tasks. They gave 33 the option to use Codex, a fine-tuned version of GPT-3, through a custom user interface. The remaining 14 served didn’t receive automated assistance. Both groups were allowed to copy code from the web.

  • The participants were given tasks including (1) write two Python functions that encrypt and decrypt a string respectively, (2) write a Python function that signs a message with a cryptographic key, (3) write a Python function that returns a File object for a given file path, and (4) write a Javascript function that manipulates an SQL table.
  • The authors also watched screen recordings to observe the participants’ behavior — for instance, copying code generated by Codex — and note the origins of programming errors.
  • After completing the tasks, participants rated their confidence in the correctness and security of their answers. The Codex group also rated their trust in the model’s ability to generate secure code for each task.

Results: The authors evaluated the responses manually according to whether they were functional and secure. Participants who used Codex generally produced code that was less functional and secure, yet they expressed greater confidence in it. That said, the results varied with the task and programming language.

  • Members who used Codex to produce nonfunctional code were more likely to rate their answers as more correct than members of the non-Codex group who produced correct code.
  • When coding in Python, participants in the non-Codex group were more than twice as likely to produce secure code.
  • Members of the Codex group who lacked prior digital-security experience were more likely to use unedited, generated code than those who had such experience (especially when coding in Javascript, a less-familiar language for many participants).

Behind the news: Other research bolsters the notion that professional developers shouldn’t fear for their jobs quite yet. In a 2022 study, DeepMind’s AlphaCode model competed in 10 simulated contests. The model correctly solved 34 percent of the validation questions and outpaced 46 percent of humans who had taken up the same challenges.

Why it matters: Generative coding tools are often regarded as a way for programmers to save time and automate basic tasks. But that efficiency may come at a price. Coders who use such tools would do well to pay extra attention to debugging and security.

We’re thinking: Code generation is an exciting development despite the questions raised by this study. We welcome further studies that compare programmers who use Codex, those who copy code from the internet, and those who use no outside assistance. How long, on average, would it take subjects in each group to complete the tasks correctly and securely, taking into account the time required to debug generated code?