A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a randomized subspace optimization framework for LLM training that reduces both activation and optimizer memory costs while ensuring convergence and achieving performance on par with GaLore and Adam.
Abstract: The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.
Lay Summary: Training large AI models like ChatGPT often requires huge amounts of computer memory, especially when using common methods like the Adam optimizer. While some recent techniques help reduce part of the memory load, they still struggle with other major sources of memory usage—like the data temporarily stored during training. These limitations make it hard to train models on long texts or with large batches of data. In our work, we propose a new training method that reduces memory usage more effectively. Instead of training the full model at once, we break the process into smaller, manageable parts and focus on a random piece of the model each time. This strategy saves memory in multiple ways and still achieves strong performance. We also provide mathematical proof that our method works reliably. Experiments show that our approach can match the performance of popular methods while using much less memory and communication.
Primary Area: Optimization->Stochastic
Keywords: randomized subspace optimization, large language model training, stochastic optimization
Submission Number: 9970
Loading