Keywords: robustness, safeguards
TL;DR: We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses.
Abstract: We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. We first identify vulnerabilities in existing systems that evaluate model outputs without regard to the conversational context, and address these vulnerabilities using full exchange classifiers. Building on this, we implement a classifier cascade where lightweight classifiers screen all traffic, escalating only suspicious exchanges to more expensive classifiers. Combining this approach with other optimizations, we develop a new production-grade jailbreak defense system that achieves a 5.4× computational cost reduction compared to our baseline exchange classifier, while also achieving a 0.036% refusal rate on production traffic. Through extensive red-teaming comprising over 560K queries, we demonstrate protection against universal jailbreaks---no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Finally, we explore efficient classification techniques by training linear activation probes. We show using logit smoothing and a weighted loss function is crucial for performance, and further that probes can be combined with external classifiers to provide particularly strong performance. Our work establishes Constitutional Classifiers as practical safeguards for large language models.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14259
Loading