Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

ACL ARR 2025 May Submission698 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the influence between each latent feature and the model's output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model's output, and (2) only latents with high influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: hierarchical & concept explanations, probing

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 698

Loading