GALE: Gradient Activation Low-rank Extraction for Fast Memory Efficient Large Language Model Training

GALE: Gradient Activation Low-rank Extraction for Fast Memory Efficient Large Language Model Training

ICLR 2026 Conference Submission22566 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Memory-Efficient Training, Large Language Models, Gradient Projection, Low-Rank Approximation, Randomized Numerical Linear Algebra

TL;DR: GALE dramatically accelerates memory-efficient LLM training by replacing the slow SVD used in gradient projection with a fast randomized QR algorithm, achieving up to a 23x speedup in the update step without sacrificing performance.

Abstract: Training large language models (LLMs) is highly memory-intensive, primarily due to the storage of optimizer states. A major barrier to scaling LLMs is the memory required to store optimizer states. Current memory-saving strategies involve a trade-off: Parameter-Efficient Fine-Tuning (PEFT) methods are fast and memory-efficient but often underperform compared to full-parameter training. Conversely, gradient projection methods like GaLore enable full-parameter learning with a low memory footprint. However, they introduce a computational overhead within the optimizer step: their reliance on exact decompositions like SVD is prohibitively slow. We introduce \textbf{Gradient Activation Low-rank Extraction (GALE)}, a method that resolves this overhead. GALE re-engineers the gradient projection pipeline. Instead of SVD, it uses a randomized sketching + QR decomposition algorithm. This approach eliminates a key computational bottleneck in the update step by accelerating the low-rank optimizer update step by up to 23$\times$ over GaLore. This removes the overhead of gradient projection, resulting modest gains in training throughput. We present GALE in several variants, including an optimized version using mixed-precision fused kernels, which both modestly improve throughput and boost final task performance. When pre-training LLaMA models on the C4 dataset, GALE's task performance maintains that of GaLore while consistently and substantially outperforming PEFT methods. On the GLUE fine-tuning benchmark, GALE reduces the performance gap to leading PEFT techniques while removing the optimizer overhead of GaLore, thereby achieving higher training throughput than prior gradient projection methods and making full-parameter fine-tuning more computationally practical. By effectively balancing memory, performance, and computational speed, GALE sets a new practical frontier for efficient full-parameter LLM training. Code to replicate our findings can be found at \href{https://anonymous.4open.science/r/GALE/README.md}{GitHub}.

Primary Area: optimization

Submission Number: 22566

Loading