Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circuits of Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a modular circuits discovery method called ModCirc.
Abstract: Mechanistic interpretability (MI) research aims to understand large language models (LLMs) by identifying computational circuits, subgraphs of model components with associated functional interpretations, that explain specific behaviors. Current MI approaches focus on discovering task-specific circuits, which has two key limitations: (1) poor generalizability across different language tasks, and (2) high costs associated with requiring human or advanced LLM interpretation of each computational node. To address these challenges, we propose developing a ``modular circuit (MC) vocabulary'' consisting of task-agnostic functional units. Each unit consists of a small computational subgraph with its interpretation. This approach enables global interpretability by allowing different language tasks to share common MCs, while reducing costs by reusing established interpretations for new tasks. We establish five criteria for characterizing the MC vocabulary and present ModCirc, a novel global-level mechanistic interpretability framework for discovering MC vocabularies in LLMs. We demonstrate ModCirc's effectiveness by showing that it can identify modular circuits that perform well on various metrics.
Lay Summary: How do powerful AI language models like ChatGPT actually work inside? Current methods for understanding these systems analyze each task separately and require expensive human interpretation of every component, making comprehensive analysis impractical as AI systems grow larger. We developed a new approach called ModCirc that identifies reusable "building blocks" within AI models that perform similar functions across different tasks. Think of these like specialized tools in a workshop that can be used for multiple projects rather than creating new tools from scratch each time. For example, we found components in a medical AI that consistently identify patient symptoms whether the task involves diagnosis, treatment recommendations, or medical summarization. Our method creates a vocabulary of these reusable components with pre-established interpretations, dramatically reducing the cost of understanding new AI behaviors. When analyzing a new task, researchers can match components against this existing vocabulary instead of starting interpretation from zero. Testing on both general and domain-specific tasks, we successfully identified some important components and demonstrated clear patterns of reuse across different applications. This approach can make AI systems more interpretable in a more affordable and scalable manner.
Link To Code: https://github.com/YinhanHe123/ModCirc
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Mechanistic Interpretability, Modular Circuits
Submission Number: 8221
Loading