Jailbreak Defense in LLM via Attention Head Analysis and Selective Intervention

Published: 01 Sept 2025, Last Modified: 18 Nov 2025ACML 2025 Conference TrackEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Jailbreak attacks reveal a persistent gap between the intended alignment of language models and their actual behavior during inference. To address this, we investigate how such attacks succeed at the internal level of model computation, focusing on attention heads. Unlike previous studies that primarily analyzed why jailbreaks work, our approach aims to develop a defense mechanism. We identify attention heads that influence whether a model produces a harmful or safe response by comparing activation patterns between a harmful prompt that is rejected and its adversarial variant that elicits a harmful response. By interpolating the internal representations of these heads between the two scenarios, we suppress harmful outputs while maintaining appropriate responses to benign prompts. Experiments with representative jailbreak methods, including GCG and AutoDAN, show that our method significantly reduces attack success rates without degrading response quality. For instance, with Llama-2-7b-chat, the average success rate drops from 39.3% to 1.1%. These findings reveal how internal attention dynamics affect output generation and demonstrate that targeted manipulation of internal components can enhance safety without requiring external filters or additional training.
Supplementary Material: pdf
Submission Number: 96
Loading