Keywords: Circuit Analysis, Attribution Graphs, Automated interpretability
TL;DR: LLMs can Annotate Attribution Graphs
Abstract: Circuit tracing is an exciting technique for revealing internal computation in language models, but it requires a time-intensive manual step of grouping individual features or MLP neurons into supernodes. We present a simple pipeline for automating this step: directly presenting feature descriptions to a language model that groups them into supernodes. Using automated interpretability metrics, we confirm that supernodes generated by our pipeline are as interpretable as those generated by human annotators. On a two-hop Capitals task, our pipeline recovers a supernode corresponding to the intermediate hop in 97 of 100 prompts. Finally, we present a simple proof of concept using our pipeline for open-ended exploration, where we first automatically annotate 200 attribution graphs from Wikipedia prompts and then use an LLM judge to flag graphs worth human review. We hope this work demonstrates that even simple automation can produce meaningful attribution graph annotations, motivating further work on automated circuit tracing.
Submission Number: 617
Loading