Keywords: fine-tuning, low rank adaptation, max margin solution
Abstract: Low–Rank Adaptation (LoRA) has become the de-facto parameter-efficient fine-tuning algorithm.
Besides training-efficiency, practitioners observe two striking benefits: *(i)* remarkable resistance to *catastrophic forgetting*, and *(ii)* independently trained adapters can be *merged* into a single model that performs well on multiple tasks. Despite their practical importance, these phenomena have lacked rigorous theoretical explanation. In this work, we provide the first theoretical justification for the two aforementioned phenomena by analyzing the structure of LoRA solutions in multiclass linear classification problems for orthogonal tasks.
Our analysis shows that, under suitable weight regularization, the optimal LoRA adapter aligns exactly with the \emph{max-margin} (hard-margin SVM) solution for the fine-tuning data. This alignment lets us track in closed form how the normalized margins on the pre-training data, fine-tuning data and their union vary with the regularization parameter.
For *(i)*, we observe a trade-off: decreasing the regularization parameter enlarges the fine-tuning margin while proportionally shrinking the pre-training margin, never collapsing it to zero.
Concerning *(ii)*, we view the merged weights through the same margin lens, we prove why merging succeeds and derive optimal mixing coefficients that maximize the margin on the union of all tasks.
Finally, we numerically validate our theory across multiple deep learning architectures and task configurations. The empirical results closely match our theoretical predictions.
Taken together, our results give the first principled explanation for LoRA’s resistance to forgetting and its surprising merging ability.
Primary Area: optimization
Submission Number: 10259
Loading