Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

Paul Darm; Annalisa Riccardi

Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

Paul Darm, Annalisa Riccardi

Published: 24 Aug 2025, Last Modified: 24 Aug 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Robust alignment guardrails for large language models (LLMs) are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination. Our method applies fine-grained interventions at specific attention heads, which we identify by probing each head in a simple binary choice task. We then show that interventions on these heads generalise to the open-ended generation setting, effectively circumventing safety guardrails. We demonstrate that intervening on a few attention heads is more effective than intervening on full layers or supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. We also demonstrate applying interventions in the negative direction can prevent a common jailbreak attack. Our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviours. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety requiring fine-grained control over the model output.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: - added jailbreak results - added table about robustnest - added more domains - added different models results - added more recent citations - spell checking

Code: https://github.com/PaulDrm/targeted_intervention

Assigned Action Editor: ~Alec_Koppel1

Submission Number: 4829

Loading