Anthropic develops robust defense against universal jailbreaks

Community-Team · February 7, 2025, 7:42pm

Subscribe for free access to Data Points!

Anthropic’s new Constitutional Classifiers system successfully defended against thousands of hours of human attempts to jailbreak its Claude models. The method reduced jailbreak success rates from 86% to 4.4% in automated tests, with minimal increases in refusal rates and compute costs. The system works by training input and output classifiers on synthetically generated data based on a “constitution” of allowed and disallowed content, enabling it to detect and block potentially harmful inputs and outputs. Anthropic is hosting a live demo and offering rewards up to $20,000 for successful jailbreaks to further test and improve the system’s robustness. (Anthropic and arXiv)

Topic		Replies	Views
Industrial-Strength LLM AI Discussions the-batch , ai-discussions	0	116	September 1, 2023
Best Way to Get Anthropic API Key Building Towards Computer Use with Anthropic	1	131	February 1, 2025
✨ New course! Enroll in Building Towards Computer Use with Anthropic News and Announcements short-course , dl-ai-learning-platform	8	170	February 12, 2025
Navigating LLM Threats: Detecting Prompt Injections and Jailbreaks Events event , live-stream , youtube	11	1124	January 10, 2024
Binary classifier just defaults to single value AI Discussions	1	47	May 18, 2023

Anthropic develops robust defense against universal jailbreaks

Related topics