Interpretable Steering of Large Language Models with Feature Guided Activation Additions

Samuel Soo; Wesley Teng; Chandrasekaran Balaganesh; Tan Guoxian; Ming YAN

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

Samuel Soo, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Ming YAN

Published: 05 Mar 2025, Last Modified: 02 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Long Paper Track (up to 9 pages)

Keywords: interpretability and explainable AI, activation steering, sparse autoencoders, mechanistic interpretability applications

TL;DR: We introduce a new method for steering large language models that combines sparse autoencoder features with activation engineering to achieve more precise and interpretable control over model behavior.

Abstract: Effective and reliable control over Large Language Model behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise, human-interpretable steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate that FGAA outperforms existing steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also highlight important trade-offs between steering scale and general model capabilities that are consistent across all tested steering methods.

Submission Number: 11

Loading