Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.
Lay Summary: This research investigates how large language models internally store, organize, and transform concepts across their different layers. By analyzing how specific features emerge, evolve, or disappear as they move through the model, the study creates detailed maps of these relationships, revealing patterns that were previously unclear. These insights not only improve our understanding of how these complex systems process information but also enable more precise control over their behavior. By modifying key concepts, users can now guide the model’s internal mechanisms more effectively, leading to greater transparency and better-controlled outcomes.
Primary Area: Deep Learning->Large Language Models
Keywords: Mechansitic Interpretability, Sparse Autoencoders, Steering, Large Language Models
Submission Number: 4619
Loading