INTENTION MATCHING STOPS JAILBREAKS

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM safety, jailbreak defense
TL;DR: a robust and effficient intention extraction method for LLM jailbreak defense
Abstract: Large Language Models (LLMs) are vulnerable to jailbreak attacks even with safety alignments. Existing defenses typically lack precise localization of harmful intent, leading to ineffective defense when faced with complex jailbreak prompts. For precise localization, we exploit ‘semantic-consistency’ between an input-output pair: regardless of the jailbreak input complexity, the outputs always respond according to the actual input intents. In this paper, we present SENTINEL, a plug and play module that can be fit into the auto-regressive generation process for any model, systematically exploits ‘semantic-consistency’ to extract intent for jailbreaks. Specifically, during generation process, we solve an optimization problem to extract semantically aligned sub-sequences for an input-output pair, then we efficiently quantify the harmfulness by using the refusal direction projection value, and determine should we halt the generation process or not as the defense. Experiments demonstrate that SENTINEL significantly reduces attack success rates mostly below 5% for on various jailbreaks across all evaluated LLMs, also we explained the fundamental mechanism as re-distributing jailbreak features from alignment blind-spot to aligned regions.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10779
Loading