Keywords: LLM safety, jailbreak defense
TL;DR: a robust and effficient intention extraction method for LLM jailbreak defense
Abstract: Large Language Models (LLMs) are vulnerable to jailbreak attacks even with safety
alignments. Existing defenses typically lack precise localization of harmful intent,
leading to ineffective defense when faced with complex jailbreak prompts. For
precise localization, we exploit ‘semantic-consistency’ between an input-output
pair: regardless of the jailbreak input complexity, the outputs always respond
according to the actual input intents. In this paper, we present SENTINEL, a
plug and play module that can be fit into the auto-regressive generation process
for any model, systematically exploits ‘semantic-consistency’ to extract intent
for jailbreaks. Specifically, during generation process, we solve an optimization
problem to extract semantically aligned sub-sequences for an input-output pair, then
we efficiently quantify the harmfulness by using the refusal direction projection
value, and determine should we halt the generation process or not as the defense.
Experiments demonstrate that SENTINEL significantly reduces attack success
rates mostly below 5% for on various jailbreaks across all evaluated LLMs, also
we explained the fundamental mechanism as re-distributing jailbreak features from
alignment blind-spot to aligned regions.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10779
Loading