Keywords: LLM Safety, Out-of-Distribution Detection, Jailbreaking, Representation Learning, Selective Generation, Anomaly Detection
TL;DR: Our paper presents T3, an efficient, out-of-distribution-based safety method that models the features of "safe" prompts to achieve state-of-the-art performance in detecting jailbreaks and toxic content while mitigating overrefusal.
Abstract: Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from \emph{deeply understanding what is safe}. We introduce \textbf{T}rust \textbf{T}he \textbf{T}ypical \textbf{(T3)}, a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6\% overhead even under dense evaluation intervals on large-scale workloads.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22911
Loading