AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Zhengxuan Wu; Aryaman Arora; Atticus Geiger; Zheng Wang; Jing Huang; Dan Jurafsky; Christopher D Manning; Christopher Potts

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: In this work, we introduce AxBench, a benchmark for evaluating LM control methods at scale using synthetic data.

Abstract: Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.

Lay Summary: Imagine trying to teach a voice-assistant to avoid spoilers, speak politely, or explain ideas at a child-friendly level every single time it answers. Researchers use two main tricks to guide these systems today: (1) writing clever prompts and (2) re-training the model on lots of new examples. A third family of methods—tweaking what happens *inside* the model’s hidden layers—has attracted growing interest because it promises faster, more targeted control. Yet no one has had a single testbed for judging which approach actually works best. Our work introduces **AxBench**, the first large-scale benchmark designed to compare all three strategies on two everyday challenges: 1. **Steering** – getting the model to talk *about* or *avoid* a chosen topic. 2. **Concept detection** – quickly spotting whether a passage already contains that topic. Running AxBench on open-source Gemma models (2-billion and 9-billion parameters), we found: * Well-crafted prompts still give the most reliable steering, with full model retraining close behind. * For detecting concepts, simple statistical checks inside the model outperform everything else. * A popular interpretability tool, sparse autoencoders, surprisingly lags on both tasks. Finally, we present **ReFT-r1**, a lightweight way to nudge the model’s internal representations. It competes with the best methods on both steering and detection while remaining transparent about *why* it works. To help others build on this, we are releasing AxBench, our evaluation code, and ready-to-use feature dictionaries for the community.

Link To Code: https://github.com/stanfordnlp/axbench

Primary Area: Social Aspects->Accountability, Transparency, and Interpretability

Keywords: LLMs, Interpretability, Sparse Autoencoders, Dictionary Learning

Submission Number: 12232

Loading