From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications
TL;DR: Our work provides a theoretical perspective of gradient subspace stabilization and present WeLore as a novel framework for LLM compression and memory-efficient finetuning.
Abstract: Large Language Models (LLMs) matrices can often be expressed in low-rank format with potential to relax memory and compute resource requirements. Unlike previous works which pivot around developing novel matrix decomposition algorithms, in this work we focus to study the emerging non-uniform low-rank properties across weight matrices in LLMs through the lens of stabilizing gradient subspace. \textit{Firstly,} we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. \textit{Secondly,} we empirically establish a consequential relationship between the gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present \textit{Weight Low-Rank Projection} \textbf{(WeLore)} that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient dynamics perspective illustrate that \textit{LRCs tend to have better finetuning capabilities} and their standalone finetuning can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. All codes and checkpoints will be released.
Lay Summary: Large Language Models (LLMs) are extremely powerful but require enormous memory and computing resources, making them expensive and difficult to deploy widely. While researchers know these models can sometimes be compressed by representing their huge weight matrices in a simpler, “low-rank” form, it’s not clear why this low-rank structure emerges or how best to exploit it for better low-rank LLM compression without hurting performance. We discovered that different parts of LLMs naturally develop varying degrees of low-rank structure during training, closely tied to how the model’s gradients (the signals guiding learning) stabilize over time. Building on this, we created WeLore: a method that automatically analyzes each part of a pre-trained LLM to decide how much it can be compressed, focusing on those components that are truly low-rank. WeLore then compresses these parts and, when fine-tuning the model for new tasks, updates only the most compressible components—saving memory and computation.
Link To Code: https://github.com/VITA-Group/WeLore
Primary Area: Deep Learning->Foundation Models
Keywords: Large language models, Gradient subspace, Memory-efficient training, Optimization, Low Rank Compression
Submission Number: 13235
Loading