Keywords: Mechanistic interpretability, Activation engineering, Large language models, Sparse autoencoders
TL;DR: The first unified mechanistic interpretation framework bridging the gap between pure neuron interpretation and effective behavior control.
Abstract: Existing works in neuron interpretations and behavior control in Large Language Models are largely developed independently of each other. On one hand, the pioneering works in neuron interpretation rely on training sparse autoencoders (SAE) to extract interpretable concepts. However, interventions on these concepts are shown to be less effective in model behavior control. On the other hand, dedicated behavior control approaches rely on adding a steering vector to the neurons during the model inference, while ignoring the aspect of interpretation. In this work, we present a unified framework that establishes connections between them, which is crucial to truly understand the model behavior via interpretable internal representations. Compared to existing SAE based interpretation frameworks, the unified framework not only enables effective behavior control, but also uniquely allows flexible user-friendly concept specification and maintains the model performance. Compared to dedicated behavior control approaches, we guarantee the steering effect in behavior control while additionally explaining which concept has how much contribution to the steering process and the roles of them in explaining the to-be-steered neurons. Our work sheds light on designing better interpretation frameworks that explicitly consider the aspect of control during the interpretation.
Primary Area: interpretability and explainable AI
Submission Number: 15247
Loading