Defcon Contest Highlights AI Security: A hacker competition to break guardrails around language models


Hackers attacked AI models in a large-scale competition to discover vulnerabilities.

What’s new: At the annual Defcon hacker convention in Las Vegas, 2,200 people competed to break guardrails around language models, The New York Times reported. The contest, which was organized by AI safety nonprofits Humane Intelligence and SeedAI and sponsored by the White House and several tech companies, offered winners an Nvidia RTX A6000 graphics card.

Breaking models: Contestants in the Generative Red Team Challenge had 50 minutes to perform 21 tasks of varying difficulty, which they selected from a board like that of the game show Jeopardy. Seven judges scored their submissions.

  • Anthropic, Cohere, Google, Hugging Face, Meta, Nvidia, OpenAI, and Stability AI provided large language models for competitors to poke and prod.
  • Among the flaws discovered: inconsistencies in language translations, discrimination against a job candidate based on caste, and a reference to a nonexistent 28th amendment to the United States Constitution.
  • Two of the four winning scores were achieved by Stanford computer science Cody Ho, who entered the contest five times.
  • The organizers plan to release the contestants’ prompts and model outputs to researchers in September 2023 and a public dataset in August 2024.

Behind the news: Large AI developers often test their systems by hiring hackers called “red teams,” a term used by the United States military to represent enemy forces in Cold War-era war games, to attack them.

  • Google shed light on its red team in a July blog post. Members attempt to manipulate Google’s models into outputting data not intended by its developers, eliciting harmful or biased results, revealing training data, and the like.
  • Microsoft also recently featured its red team. The team, which started in 2018, probes models available on the company’s Azure cloud service.
  • OpenAI hired a red team of external researchers to evaluate the safety of GPT-4. They coaxed the model to produce chemical weapon recipes, made-up words in Farsi, and racial stereotypes before developers fine-tuned the model to avoid such behavior.

Why it matters: The security flaws found in generative AI systems are distinctly different from those in other types of software. Enlisting hackers to attack systems in development is essential in sniffing out flaws in conventional software. It’s a good bet for discovering deficiencies in AI models as well.

We’re thinking: Defcon attracts many of the world’s most talented hackers — people who have tricked ATMs into dispensing cash and taken over automobile control software. We feel safer knowing that this crowd is on our side.