Subspace Optimization for Large Language Models with Convergence Guarantees

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper shows the non-convergence of GaLore and proposes variants with convergence guarantees.
Abstract: Subspace optimization algorithms, such as GaLore (Zhao et al., 2024), have gained attention for pre-training and fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we reveal that GaLore does not always converge to the optimal solution and provide an explicit counterexample to support this finding. We further explore the conditions under which GaLore achieves convergence, showing that it does so when either (i) a sufficiently large mini-batch size is used or (ii) the gradient noise is isotropic. More significantly, we introduce **GoLore** (**G**radient rand**o**m **Lo**w-**r**ank proj**e**ction), a novel variant of GaLore that provably converges in typical stochastic settings, even with standard batch sizes. Our convergence analysis extends naturally to other subspace optimization algorithms. Finally, we empirically validate our theoretical results and thoroughly test the proposed mechanisms. Codes are available at https://github.com/pkumelon/Golore.
Lay Summary: Training large AI models like ChatGPT requires significant computing power, making the process expensive and energy-intensive. GaLore, a recent method, helps reduce memory usage during training, making it more efficient. However, GaLore sometimes fails to produce strong results, especially when training with the small batch sizes typically used in practice. This is due to its sensitivity to the random noise present in the training process. To better understand this issue, we analyzed GaLore’s performance and found that while large batch sizes can help, they aren’t always practical due to hardware limits. To overcome this, we developed GoLore, an improved training method that uses random projections to reduce the impact of noise. GoLore keeps the memory efficiency of GaLore while achieving more stable and accurate training, even with small batches. Our work identifies a key limitation of GaLore and offers a practical solution. GoLore makes it easier to train large models efficiently and reliably, helping lower the barriers to building powerful AI systems.
Link To Code: https://github.com/pkumelon/Golore
Primary Area: Optimization
Keywords: Large Language Models, Memory-Efficient Training, Gradient Projection
Submission Number: 9646
Loading