GALE: Gradient Activation Low-rank Extraction for Fast Memory Efficient Large Language Model Training
Keywords: Memory-Efficient Training, Large Language Models, Gradient Projection, Low-Rank Approximation, Randomized Numerical Linear Algebra
TL;DR: GALE dramatically accelerates memory-efficient LLM training by replacing the slow SVD used in gradient projection with a fast randomized QR algorithm, achieving up to a 23x speedup in the update step without sacrificing performance.
Abstract: Training large language models (LLMs) is resource-intensive, with optimizer states consuming a significant portion of GPU memory. While system-level optimizations like ZeRO and new hardware have mitigated this at scale, memory constraints remain a critical hurdle for democratizing LLM training on consumer-grade hardware and maximizing efficiency on constrained clusters. Current memory-saving strategies involve a trade-off: Parameter-Efficient Fine-Tuning (PEFT) methods are fast and memory-efficient but often underperform compared to full-parameter training. Conversely, gradient projection methods like GaLore enable full-parameter learning with a low memory footprint, though the computational cost of Singular Value Decomposition (SVD) remains a bottleneck. While recent works have attempted to mitigate this, we introduce \textbf{Gradient Activation Low-rank Extraction (GALE)}, a method that advances this optimization further. GALE re-engineers the gradient projection pipeline. Instead of SVD, it uses a randomized sketching + QR decomposition algorithm. This approach eliminates a key computational bottleneck in the update step by accelerating the low-rank optimizer update step by up to 23$\times$ over GaLore. This removes the overhead of gradient projection, resulting modest gains in training throughput. We present GALE in several variants, including an optimized version using mixed-precision fused kernels, which both modestly improve throughput and boost final task performance. When pre-training LLaMA models on the C4 dataset, GALE's task performance maintains that of GaLore while consistently and substantially outperforming PEFT methods. On the GLUE fine-tuning benchmark, GALE reduces the performance gap to leading PEFT techniques while removing the optimizer overhead of GaLore, thereby achieving higher training throughput than prior gradient projection methods and making full-parameter fine-tuning more computationally practical. By effectively balancing memory, performance, and computational speed, GALE sets a new practical frontier for efficient full-parameter LLM training. Code to replicate our findings can be found at \href{https://anonymous.4open.science/r/GALE/README.md}{GitHub}.
Primary Area: optimization
Submission Number: 22566
Loading