Spectral Dynamics of Low-Rank Adaptation: Rank Monotonicity, Implicit Bias, and Optimal Rank Selection in LoRA and Tucker Fine-Tuning
Keywords: Low-Rank Adaptation (LoRA), Parameter-Efficient Fine-Tuning (PEFT), Spectral Dynamics, Implicit Bias, Rank Selection, Tucker Decomposition
TL;DR: This paper establishes the theoretical spectral dynamics of LoRA training to prove that overparameterization is harmless , which yields a zero-cost method for optimal rank selection and a highly parameter-efficient Tucker tensor extension.
Abstract: Low-Rank Adaptation (LoRA) has become a dominant paradigm for parameter-efficient fine-tuning of large language models, yet its theoretical underpinnings remain incompletely understood. We establish a precise characterization of the spectral dynamics of LoRA training: under gradient flow on the bilinear factorization $\Delta W = AB$, the singular values of the learned weight update $M(t) = A(t)B(t)$ evolve approximately as $\sigma_i(M(t)) \approx \lambda_i \tanh^2 \left( \frac{\lambda_i t}{\sqrt{2}} \right)$, growing in strictly decreasing order of $\lambda_i$ --- the singular values of the oracle update $\Delta W^*$. Three practically important consequences follow: (i) overparameterized LoRA (rank $r > r^*$) is provably benign --- extra singular values converge to zero; (ii) an optimal rank selection rule $\hat{r} = \max\{r : \sigma_r(G_0)^2 \geq C/n\}$ can be computed cheaply from the pre-finetuning gradient spectrum $G_0 = \nabla_W \mathcal{L}(W_0)$, requiring no auxiliary training; and (iii) these results extend to Tucker-LoRA, a multilinear generalization that adapts weight tensors via Tucker decompositions, achieving asymptotically superior parameter efficiency for tensor-structured weights. We validate all theoretical predictions on BERT-base (GLUE), Llama-3-8B (MT-Bench), and ViT-B/16 (CIFAR-100), finding that our spectral rank selection rule matches or outperforms AdaLoRA (Zhang et al., 2023) while requiring zero additional training overhead.
Submission Number: 34
Loading