Keywords: Large Language Models, Jailbreak Detection, Model Safety, Linear Probing, Latent Representations, Hidden States
TL;DR: Latent Sentinel is a lightweight, real-time jailbreak detection framework that uses linear probes on frozen LLM hidden layers to accurately identify adversarial prompts with minimal computational overhead.
Abstract: We present Latent Sentinel, a lightweight and architecture-portable framework for online screening of jailbreak prompts in large language models (LLMs). Under a pre-generation input-filtering threat model, we attach tiny linear probes to the frozen hidden states of multiple Transformer layers and aggregate their scores in real time. Without modifying base weights, Latent Sentinel adds <0.003% parameters and incurs ≈0.1–0.13% latency overhead on a single A100. We train on 50k adversarial prompts (25k jailbreak + 25k red-teaming) from JailbreakV-28k and 50k benign prompts from Alpaca (≈90/10 split), and evaluate on JailbreakBench, AdvBench, and MultiJail spanning 17 categories, five attack families (SR/MR/PE/AS/DAN), and 10 languages (EN/ZH/IT/VI/AR/KO/TH/BN/SW/JV). On Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, Latent Sentinel achieves 98–100% detection on JailbreakBench/AdvBench and maintains high cross-lingual accuracy; performance remains strong on alignment-degraded variants produced via shadow-alignment SFT. Ablations show that layer-wise coverage and cross-layer aggregation are critical, and threshold calibration improves specificity on benign inputs. We also observe reduced specificity for some out-of-distribution benign prompts, underscoring the need for deployment-time calibration. Overall, the results suggest that adversarial intent is approximately linearly separable in LLM latent space and establish layer-wise linear probing as a practical, real-time defense primitive for trustworthy LLM deployment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7828
Loading