Circuit Oracle: Automating Attribution Graph Analysis via Natural-Language Queries
Keywords: Interpretability, Multi-agent systems, AI Safety
TL;DR: We can give a circuit oracle a natural language query, and it would parse the attribution graph via an agentic framework to give interpretability-based insights relevant to the query.
Abstract: Attribution graphs, an emerging tool in mechanistic interpretability, use transcoders to decompose language model computations into sparse interpretable features connected by causal edges. However, turning a graph into a safety-relevant insight requires hours of manual analysis by experts. We introduce \textbf{Circuit Oracle}, a multi-agent system that automates this analysis by autonomously answering natural-language questions about a target model (e.g., ``Is this prediction driven by spurious features?'') through multi-hop traversal of the attribution graph. We evaluate Circuit Oracle on three safety-relevant proxy tasks: detecting spurious features in probe circuits, eliciting hidden knowledge from taboo-finetuned models, and jailbreaking via causal interventions. On all three tasks, the oracle is comparable to or exceeds task-specific baselines that do not use the attribution graph. The circuit oracle requires no fine-tuning as each task is specified by a modular \textit{skill}, a natural-language prompt paired with task-specific tools such as transcoder-feature steering, making the framework extensible by construction. Our results suggest that off-the-shelf agents reading attribution graphs through tool calls offer a practical, general-purpose route to automated mechanistic interpretability.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 172
Loading