ConeSpace: A Cone Space-based Framework for Detecting Jailbreak Attacks in Natural Language Processing

ConeSpace: A Cone Space-based Framework for Detecting Jailbreak Attacks in Natural Language Processing

ACL ARR 2026 January Submission2955 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: ConeSpace, Jailbreak Defense, LLM Security, Geometric Detection

Abstract: Large Language Models (LLMs) are increasingly vulnerable to sophisticated jailbreak attacks, particularly pair attacks, which embed malicious instructions within ostensibly benign contexts. Existing defense mechanisms often fail because they rely on surface-level patterns or assume linear separability in the embedding space, thereby overlooking crucial directional and contextual nuances. To address these limitations, we introduce ConeSpace, a novel geometric framework that models distinct jailbreak attacks as specific cone-shaped regions within the high-dimensional embedding space. Our approach explicitly constructs unique Cone Axes, derived from the centroids of verified attack samples, to serve as the directional backbone for these regions. We then define the precise boundaries using four key geometric metrics relative to the Cone Axis: direction similarity, magnitude ratio, projection length, and Euclidean distance. The framework is underpinned by a Critical Layer Selection mechanism based on geometric separability metrics, which identifies the optimal network depth for detection. Furthermore, we propose a variance-adaptive thresholding strategy based on attack distribution characteristics, applying strict constraints for consistent attacks and more lenient boundaries for evasive ones. Extensive experiments on nine benchmark datasets across multiple LLM architectures (including Llama, Mistral, and Vicuna) demonstrate that ConeSpace achieves 94.9% accuracy and a 97.4% F1-score. It outperforms state-of-the-art methods by 3.5% and yields a 10.5% improvement on challenging pair attacks, all while maintaining a remarkably low false positive rate.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: NLP Applications, Dialogue and Interactive Systems

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data analysis

Languages Studied: English

Submission Number: 2955

Loading