DEAL: Disentangling Transformer Head Activations for LLM Steering

Published: 07 Jul 2025, Last Modified: 07 Jul 2025KnowFM @ ACL 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety and alignment; LLM steering
Abstract: Inference-time steering seeks to modify the response characteristics of large language models (LLMs) while leaving their parameters intact. Existing techniques rely on superficial cues or ad-hoc heuristics to choose intervention modules, often yielding inaccurate or harmful outcomes. We introduce a principled causal-attribution framework that ranks behavior-relevant attention heads. Our method trains a vector-quantized variational autoencoder (VQ-AE) on each head’s hidden states, and a learnable codebook disentangles the latent space into behavior-related versus ancillary dimensions. To sharpen this separation, we enhance the VQ-AE loss with a supervised contrastive component. An autoregressive scorer fitted to the resulting discrete codes assigns a behavior-relevance score to each head, guiding both selection and importance weighting. Experiments across seven LLMs in two model families and five behavioral steering datasets confirm that the method accurately directs inference-time interventions.
Archival Status: Non-archival (not included in proceedings)
Submission Number: 60
Loading