Keywords: Steering vectors, steering vector interpretability, sparse autoencoders, representation engineering
TL;DR: SAEs provide misleading interpretations of steering vectors. We pinpoint the exact causes of this to aid the developments of better methods.
Abstract: Steering vectors are a promising method to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While representing steering vectors as combinations of sparse autoencoder (SAE) features appears to be a promising direction for interpreting steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.
Email Of Author Nominated As Reviewer: harry.mayne@oii.ox.ac.uk
Submission Number: 21
Loading