Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Attention mechanism, Fine-tuning, Generalization and Optimization
TL;DR: This paper explores fine-tuning strategies for the attention mechanism from a theoretical perspective, focusing on generalization and optimization issues, and provides theoretical insights to guide algorithm design.
Abstract: Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we investigate two remarkable phenomena related to the attention mechanism during the fine-tuning of LLMs. The first phenomenon, termed “Unequal Importance of Attention Matrices,” highlights the impact of fine-tuning different weight matrices. It shows that optimizing the $\mathbf{W}_v$ matrix yields significantly better performance than optimizing the $\mathbf{W}_k$ matrix. Fine-tuning only the $\mathbf{W}_q$ and $\mathbf{W}_v$ matrices is computationally efficient while delivering results comparable to, or even better than fine-tuning all three matrices ($\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$). The second phenomenon, “Attention Matrices with Customized Learning Rate Leads to Better Convergence,” emphasizes the importance of assigning distinct learning rates to these matrices. Specifically, a higher learning rate for the $\mathbf{W}_v$ matrix compared to $\mathbf{W}_q$ and $\mathbf{W}_k$ accelerates convergence and improves performance. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving lightweight algorithms in LLMs fine-tuning.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6899
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview