From Steering Vectors to Conceptors and Beyond: Compositional Affine Steering Mechanisms for LLMs

Steven Abreu; Joris Postmus

From Steering Vectors to Conceptors and Beyond: Compositional Affine Steering Mechanisms for LLMs

Steven Abreu, Joris Postmus

27 Sept 2024 (modified: 02 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: activation engineering, mechanistic interventions, model steering, large language models, activation addition, function vectors

TL;DR: We derive an optimal method for affine steering of representations in LLMs that improves over prevalent additive methods, and we provide empirical results that demonstrate improved control and flexibility across multiple tasks.

Abstract: Controlling and understanding the representations of large language models (LLMs) remain central challenges as they become more powerful. In this paper, we combine conceptor theory with recent advances in activation steering to develop a novel framework that generalizes both approaches for provably optimal affine steering. Conceptors characterize sets of neural network activations, representable as ellipsoids, and they act as soft projection matrices, enabling precise and flexible control over LLM activations while offering deeper insights into their internal representations. Our framework derives optimal affine steering functions from first principles, outperforming traditional additive steering methods across in-context learning tasks. Additionally, we use a Boolean algebra over conceptor matrices that allows for the composition of multiple steering objectives. Empirical results demonstrate that this approach surpasses existing methods for combining steering vectors. By uniting conceptor theory with activation steering, this work provides not only a more powerful tool for controlling LLM outputs, but also a principled approach for better understanding the internal mechanisms governing model representations and behavior.

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11062

Loading