From ``Sure" to ``Sorry": Detecting Jailbreak in Large Vision Language Model via JailNeurons

Yuyou Gan; Qingming Li; Junhao Li; Zhi Chen; Jinbao Li; Xiaoming Li; Shouling Ji

From ``Sure" to ``Sorry": Detecting Jailbreak in Large Vision Language Model via JailNeurons

Yuyou Gan, Qingming Li, Junhao Li, Zhi Chen, Jinbao Li, Xiaoming Li, Shouling Ji

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Vision Language Model, Jailbreak Detection

Abstract: Large Vision-Language Models (LVLMs) are vulnerable to jailbreak attacks that can generate harmful content. Existing detection methods are either limited to detecting specific attack types or are too time-consuming, making them impractical for real-world deployment. To address these challenges, we propose \textbf{JDJN} (\textbf{J}ailbreak \textbf{D}etection via \textbf{J}ail\textbf{N}eurons), a novel jailbreak detection method for LVLMs. Specifically, we focus on \textbf{JailNeurons}, which are key neurons related to jailbreak at each model layer. Unlike the ``SafeNeurons", which explain why aligned models can reject ordinary harmful queries, JailNeurons capture how jailbreak prompts circumvent safety mechanisms. They provide an important and previously underexplored complement to existing safety research. We design a neuron localization algorithm to detect these JailNeurons and then aggregate them across layers to train a generalizable detector. Experimental results demonstrate that our method effectively extracts jailbreak-related information from high-dimensional hidden states. As a result, our approach achieves the highest detection success rate with exceptionally low false positive rates. Furthermore, the detector exhibits strong generalizability, maintaining high detection success rates across unseen benign datasets and attack types. Finally, our method is computationally efficient, with low training costs and fast inference speeds, highlighting its potential for real-world deployment.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 9442

Loading