Keywords: Steering vectors, sparse autoencoders, mechanistic interpretability
TL;DR: Using sparse autoencoders to interpret steering vectors can be misleading and we explain two reasons why this is so.
Abstract: Steering vectors are a promising method to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While representing steering vectors as combinations of sparse autoencoder (SAE) features appears to be a promising direction for interpreting steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.
Track: Main track
Submitted Paper: Yes
Published Paper: No
Submission Number: 77
Loading