Abstract: Circuit discovery with edge-level ablation has become a foundational framework for mechanism interpretability of language models. However, its focus on individual edges often overlooks the sequential, path-level causal relationships that underpin complex behaviors, thus potentially leading to misleading or incomplete circuit discoveries. To address this issue, we propose a novel path-level circuit discovery framework capturing how behaviors emerge through interconnected linear chain and build towards complex behaviors.
Our framework is constructed upon a fully-disentangled linear combinations of ``memory circuits'' decomposed from the original model. To discover functional circuit paths, we leverage a 2-step pruning strategy by first reducing the computational graph to a faithful and minimal subgraph and then applying causal mediation to identify common paths of a specific skill, termed as skill paths. In contrast to circuit graph from existing works, we focus on the complete paths of a generic skill rather than on the fine-grained responses to individual components of the input. To demonstrate this, we explore three generic language skills, namely Previous Token Skill, Induction Skill and In-Context Learning Skill using our framework and provide more compelling evidence to substantiate stratification and inclusiveness of these skills.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: LLM interpretability, causal mediation analysis, circuit discovery, language skill
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: English
Submission Number: 724
Loading