DeSteer: Adaptive Defense Against LLM Jailbreak Attacks via Intent-Conditioned Latent Vector Steering

DeSteer: Adaptive Defense Against LLM Jailbreak Attacks via Intent-Conditioned Latent Vector Steering

ACL ARR 2026 January Submission4682 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Safety, Jailbreak Defense, Latent Vector Steering, Representation Engineering, Hidden-state Intervention

Abstract: Large language models (LLMs) have attracted extensive attention across various fields, but they remain vulnerable to jailbreak attacks that bypass safety alignments to elicit harmful content. Existing defense methods often resort to costly fine-tuning or brittle input modification, which suffer significantly from limited generalization and reduced utility. In this paper, we propose DeSteer, a plug-and-play defense framework that shifts the paradigm from brittle, binary input filtering to continuous, intent-conditioned latent vector steering. Specifically, DeSteer augments a frozen LLM with an intent detection head for lightweight risk assessment and a set of learned refusal vectors that encode refusal semantics. Differing from binary classifiers that abruptly halt generation in case of error, DeSteer allows the model to naturally recover from false positives, ensuring robustness without compromising utility. During the inference phase, DeSteer applies an intent-conditioned, multi-step steering mechanism that dynamically amplifies the refusal behavior only when a high risk is detected, without adjusting the base model parameters or the decoding algorithm. Extensive experiments are conducted on three LLMs using six state-of-the-art jailbreak attacks and two benchmark datasets. Our results demonstrate that DeSteer significantly reduces attack success rates without compromising model utility while outperforming six defense baselines.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: safety and alignment, robustness, red teaming,

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 4682

Loading