DeSteer: Adaptive Defense Against LLM Jailbreak Attacks via Intent-Conditioned Latent Vector Steering
Keywords: LLM Safety, Jailbreak Defense, Latent Vector Steering, Representation Engineering, Hidden-state Intervention
Abstract: Large language models (LLMs) have attracted extensive attention across various fields, but they remain vulnerable to jailbreak attacks that bypass safety alignments to elicit harmful content. Existing defense methods often resort to costly fine-tuning or brittle input modification, which suffer significantly from limited generalization and reduced utility. In this paper, we propose DeSteer, a plug-and-play defense framework that shifts the paradigm from brittle, binary input filtering to continuous, intent-conditioned latent vector steering. Specifically, DeSteer augments a frozen LLM with an intent detection head for lightweight risk assessment and a set of learned refusal vectors that encode refusal semantics. Differing from binary classifiers that abruptly halt generation in case of error, DeSteer allows the model to naturally recover from false positives, ensuring robustness without compromising utility. During the inference phase, DeSteer applies an intent-conditioned, multi-step steering mechanism that dynamically amplifies the refusal behavior only when a high risk is detected, without adjusting the base model parameters or the decoding algorithm. Extensive experiments are conducted on three LLMs using six state-of-the-art jailbreak attacks and two benchmark datasets. Our results demonstrate that DeSteer significantly reduces attack success rates without compromising model utility while outperforming six defense baselines.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment, robustness, red teaming,
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 4682
Loading