Anthropic develops robust defense against universal jailbreaks

Subscribe for free access to :arrow_forward: Data Points!

Anthropic’s new Constitutional Classifiers system successfully defended against thousands of hours of human attempts to jailbreak its Claude models. The method reduced jailbreak success rates from 86% to 4.4% in automated tests, with minimal increases in refusal rates and compute costs. The system works by training input and output classifiers on synthetically generated data based on a “constitution” of allowed and disallowed content, enabling it to detect and block potentially harmful inputs and outputs. Anthropic is hosting a live demo and offering rewards up to $20,000 for successful jailbreaks to further test and improve the system’s robustness. (Anthropic and arXiv)