Retrieval is Enough: Training-Free Interpretability with a Tool-Using Agent
Keywords: agents, interpretability, SAE, activation oracle, concept discovery
TL;DR: A vector database can be enough to replace your SAE/activation oracle
Abstract: Interpretability methods for neural network activations span a wide cost spectrum, from cheap, training-free techniques (such as linear probes, PCA, SVD) to more expensive training-based ones (such as SAEs and activation oracles). Training-based methods are typically more powerful, in part because they leverage large activation datasets during training. This raises a natural question --- do they actually surface insights that go beyond what is recoverable from the training dataset itself?
To address this, we equip an LLM agent with a vector database of activations paired with their textual contexts, along with tools for manipulating activations --- projecting out directions in latent space, computing activation differences and averages. The agent iteratively queries the database, forms hypotheses from the retrieved samples, and validates them by constructing linear probes. We call this method **HARP**, for **H**ypothesis-driven **A**gentic **R**etrieval and **P**robing. Despite not involving any training, HARP outperforms both activation oracles and SAE-based agents on concept discovery, concept detection, model steering, and secret elicitation. The training-free design also makes HARP substantially cheaper and more flexible: new datasets can be indexed on demand whenever existing ones prove insufficient. More broadly, our results suggest that current training-based methods do not yet extract insights beyond their training data, and motivate benchmarks that explicitly require interpretability methods to demonstrate such insights.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 126
Loading