Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

TMLR Paper4829 Authors

11 May 2025 (modified: 04 Aug 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Robust alignment guardrails for large language models (LLMs) are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination. Our method applies fine-grained interventions at specific attention heads, which we identify by probing each head in a simple binary choice task. We then show that interventions on these heads generalise to the open-ended generation setting, effectively circumventing safety guardrails. We demonstrate that intervening on a few attention heads is more effective than intervening on full layers or supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. We also demonstrate applying interventions in the negative direction can prevent a common jailbreak attack. Our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviours. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety requiring fine-grained control over the model output.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - added jailbreak results - added table about robustnest - added more domains - added different models results
Assigned Action Editor: ~Alec_Koppel1
Submission Number: 4829
Loading