Keywords: Large Language Models (LLMs), Safety in AI, Adversarial Robustness, Attention Heads, Refusal Mechanisms, Model Editing, Mechanistic Interpretability
TL;DR: We propose DRefA (Detection–Refusal Advanced LLM), a method that edits detection and refusal heads in LLMs to improve safety against adversarial jailbreaks.
Abstract: Ensuring the safety of large language models (LLMs) is crucial as they become increasingly integrated into real-world applications. Despite advances in training and fine-tuning techniques, LLMs remain vulnerable to generating harmful or unsafe content, especially under adversarial prompts. In this work, we investigate the internal attention mechanisms that detect harmful content and refusal behaviors in LLMs. We introduce systematic methods to identify $\textit{detection heads}$, which are highly sensitive to harmful prompts, and $\textit{refusal heads}$, which contribute to the model’s tendency to reject unsafe requests. Building on these insights, we introduce the $\textbf{Detection–Refusal Advanced LLM (DRefA)}$, an enhanced model in which detection and refusal heads are scaled to improve safety. Safety is quantified as the proportion of responses judged safe by $\texttt{Llama-Guard-3-8B}$, which we refer to as the $\textit{safety rate}$. DRefA achieves substantial robustness gains—for instance, the safety rate of LLaMA3 increases from 77% to 99% under GCG attacks and from 15% to 99% under ADV-LLM attacks. Our findings provide mechanistic insights into the structural components of LLM safety and offer practical interventions to mitigate harmful outputs, contributing to the development of more trustworthy AI systems.
Submission Number: 17
Loading