Keywords: Activation steering, steering vectors, representation engineering, ReFT, controlled generation
TL;DR: We systematically study activation steering methods in a toy setup, and find a clear tradeoff between performance ceiling and data efficiency across methods of different expressiveness.
Abstract: Activation steering is a promising family of methods for controlling LLM outputs via targeted interventions on model activations. We introduce a toy multi-label classification setup to systematically study activation steering methods, and experiment with several types of steering adapters — from steering vectors (adding a fixed vector to activations) to more expressive adapters involving projections. We evaluate the adapters across steering tasks of different complexities, for three notions of complexity: 1) how densely the features are packed in the representation space (roughly, number of features divided by the dimensionality of the activations), 2) number of attributes steered, and 3) number of values the steered attribute can take. We find that as task complexity is increased, steering vector methods perform worse, while the more expressive methods only take a performance hit when there is not enough data. On the other hand, steering vectors usually outperform more expressive methods in the low-data regime, regardless of task complexity. We conclude by discussing this work's limitations, which include our toy setup not modeling features represented in superposition or continuous features, and the lack of experiments with LLMs.
Email Of Author Nominated As Reviewer: dmkr0001@gmail.com
Submission Number: 35
Loading