Subscribe for free access to Data Points!
Anthropic’s new Constitutional Classifiers system successfully defended against thousands of hours of human attempts to jailbreak its Claude models. The method reduced jailbreak success rates from 86% to 4.4% in automated tests, with minimal increases in refusal rates and compute costs. The system works by training input and output classifiers on synthetically generated data based on a “constitution” of allowed and disallowed content, enabling it to detect and block potentially harmful inputs and outputs. Anthropic is hosting a live demo and offering rewards up to $20,000 for successful jailbreaks to further test and improve the system’s robustness. (Anthropic and arXiv)