LoRA Training Provably Converges to a Low-Rank Global Minimum Or It Fails Loudly (But it Probably Won't Fail)
TL;DR: LoRA training works because there is a global minimizer near initialization and spurious local minima are far away.
Abstract: Low-rank adaptation (LoRA) has become a standard approach for fine-tuning large foundation models. However, our theoretical understanding of LoRA remains limited as prior analyses of LoRA's training dynamics either rely on linearization arguments or consider highly simplified setups. In this work, we analyze the LoRA loss landscape without such restrictive assumptions. We define two regimes: a "special regime", which includes idealized setups where linearization arguments hold, and a "generic regime" representing more realistic setups where linearization arguments do not hold. In the generic regime, we show that LoRA training converges to a global minimizer with low rank and small magnitude, or a qualitatively distinct solution with high rank and large magnitude. Finally, we argue that the zero-initialization and weight decay in LoRA training induce an implicit bias toward the low-rank, small-magnitude region of the parameter space—where global minima lie—thus shedding light on why LoRA training usually succeeds in finding global minima.
Lay Summary: Most modern AI models are developed in two stages. First, a model learns general knowledge about a broad range of topics. Then, it is slightly adjusted to specialize in a specific task or domain. The most commonly used method for efficiently doing this second-stage adjustment is called LoRA (Low-Rank Adaptation).
These two learning stages are both essential, yet they differ in significant ways. While there is extensive research on algorithms for the initial training phase, the second, task-specific adaptation stage has received comparatively little theoretical attention.
Our paper addresses the specific question: Can the algorithms always find the best solution for a given task? A body of prior research proves that these algorithms do always find the best solution, but these results rely on oversimplified assumptions about how complex AI models work. Our work shows that, lifting these assumptions, 'bad' solutions can also exist, but conventional training methods for LoRA imply a strong bias towards the good solution, thereby 'probably not' finding the bad solutions.
This improved theoretical understanding helps ensure more reliable and efficient use of AI across various specialized tasks.
Primary Area: Theory->Deep Learning
Keywords: Low-rank adaptation, LoRA, deep learning theory, non-convex optimization, large language models, fine-tuning, post training
Submission Number: 5477
Loading