Mechanistic Insights: Circuit Transformations Across Input and Fine-Tuning Landscapes

25 Sept 2024 (modified: 26 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic Interpretability, PEFT, circuit
Abstract: Mechanistic interpretability seeks to uncover the internal mechanisms of Large Language Models (LLMs) by identifying circuits—subgraphs in the model’s computational graph that correspond to specific behaviors—while ensuring sparsity and maintaining task performance. Although automated methods have made massive circuit discovery feasible, determining the functionalities of circuit components still requires manual effort, limiting scalability and efficiency. To address this, we propose a novel framework that accelerates circuit discovery and analysis. Building on methods like edge pruning, our framework introduces circuit selection, comparison, attention grouping, and logit clustering to investigate the intended functionalities of circuit components. By focusing on what components aim to achieve, rather than their direct causal effects, this framework streamlines the process of understanding interpretability, reduces manual labor, and scales the analysis of model behaviors across various tasks. Inspired by observing circuit variations when models are fine-tuned or prompts are tweaked (while maintaining the same task type), we apply our framework to explore these variations across four PEFT methods and full fine-tuning on two well-known tasks. Our results suggest that while fine-tuning generally preserves the structure of the mechanism for solving tasks, individual circuit components may not retain their original intended functionalities.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5001
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview