Constitutional Classifiers++: Production-Grade Defenses against Universal Jailbreaks

Hoagy Cunningham; Jerry Wei; Zihan Wang; Andrew Persic; Alwin Peng; Jordan Abderrachid; Raj Agarwal; Bobby Chen; Andy Dau; Alek Dimitriev; Logan Howard; Yijin Hua; Rob Gilson; Mu Lin; Christopher Liu; Vladimir Mikulik; Rohit Mittapalli; Clare O'Hara; Jin Pan; Nikhil Saxena; Alex Silverstein; Yue Song; Giulio Zhou; Jan Leike; Jared Kaplan; Ethan Perez; Mrinank Sharma

Constitutional Classifiers++: Production-Grade Defenses against Universal Jailbreaks

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: robustness, safeguards

TL;DR: We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses.

Abstract: We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. We first identify vulnerabilities in existing systems that evaluate model outputs without regard to the conversational context, and address these vulnerabilities using full exchange classifiers. Building on this, we implement a classifier cascade where lightweight classifiers screen all traffic, escalating only suspicious exchanges to more expensive classifiers. Combining this approach with other optimizations, we develop a new production-grade jailbreak defense system that achieves a 5.4× computational cost reduction compared to our baseline exchange classifier, while also achieving a 0.036% refusal rate on production traffic. Through extensive red-teaming comprising over 560K queries, we demonstrate protection against universal jailbreaks---no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Finally, we explore efficient classification techniques by training linear activation probes. We show using logit smoothing and a weighted loss function is crucial for performance, and further that probes can be combined with external classifiers to provide particularly strong performance. Our work establishes Constitutional Classifiers as practical safeguards for large language models.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14259

Loading