Abstract: Fine-tuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool in *Mechanistic Interpretability (MI)*. Unlike previous studies (Prakash et al. 2024, Chhabra et al. 2024) that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, bringing the setup closer to real-world scenarios. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, contrasting with previous work (Prakash et al. 2024, Chhabra et al. 2024) that reported only small circuit additions after fine-tuning. Based on these observations, we develop a **circuit-aware Low-Rank Adaptation (LoRA)** method that assigns ranks to layers according to edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA achieves an average improvement of 2.46% over standard LoRA with comparable parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, offering new insights into task design and deepening our understanding of circuit dynamics and fine-tuning mechanisms.
Lay Summary: Fine-tuning large language models often boosts their performance, but it’s unclear how their internal workings adapt to new tasks. In our work, we use circuit analysis methods to study the changes in the model’s internal circuits during fine-tuning, hoping to find the phenomenon behind the increase in accuracy of LLM during fine-tuning. We find that most of the key pieces stay intact, but the strength and pattern of their connections change significantly. Leveraging this insight, we introduce a circuit-aware Low-Rank Adaptation (LoRA) method that devotes extra capacity to the most altered connections, achieving better accuracy with fewer added parameters. We further demonstrate that merging the connection patterns from simpler subtasks can effectively handle compositional tasks without finding circuit for this task again. By revealing and harnessing these hidden pathways, our method makes fine-tuning faster, more efficient, and more transparent. We hope this clearer view into the “black box” will lead to more trustworthy and adaptable AI systems.
Link To Code: https://github.com/Xu0615/FinetuneCircuits
Primary Area: Deep Learning
Keywords: Circuit Analysis, Fine-Tuning, Mechanistic Interpretability
Submission Number: 4730
Loading