Can We Partially Rewrite Transformers in Natural Language?

Gonçalo Paulo; Nora Belrose

Can We Partially Rewrite Transformers in Natural Language?

Gonçalo Paulo, Nora Belrose

12 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse autoencoders, interpretability, language models

TL;DR: With current techniques it's impossible to rewrite LLMs with natural language and sparse autocoders

Abstract: The greatest ambition of mechanistic interpretability is to completely rewrite deep neural networks in a format that is more amenable to human understanding, while preserving their behavior and performance. In this paper we evaluate whether sparse autoencoders (SAEs) and transcoders can be used for this purpose. We use an automated pipeline to generate explanations for each of the sparse coder latents. We then simulate the activation of each latent on a number of different inputs using an LLM prompted with the explanation we generated in the previous step, and "partially rewrite'' the original model by patching the simulated activations into its forward pass. We find that current sparse coding techniques and automated interpretability pipelines are not up to the task of rewriting even a single layer of a transformer: the model is severely degraded by patching in the simulated activations. We believe this approach is the most thorough way to assess the quality of SAEs and transcoders, despite its high computational cost.

Supplementary Material: zip

Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)

Submission Number: 28326

Loading