FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Philip Zmushko; Aleksandr Beznosikov; Martin Takáč; Samuel Horváth

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Philip Zmushko, Aleksandr Beznosikov, Martin Takáč, Samuel Horváth

Published: 01 May 2025, Last Modified: 14 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We create a memory-efficient optimization framework that performs full-rank updates by combining advanced methods like Adam with state-free methods like signSGD.

Abstract: With the increase in the number of parameters in large language models, the training process increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA), low-rank gradient projection (GaLore), and blockwise optimization (BAdam) have been proposed. However, in all these algorithms, the *effective rank of the weight updates remains low-rank*, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce FRUGAL (**F**ull-**R**ank **U**pdates with **G**r**A**dient sp**L**itting), a new memory-efficient optimization framework. FRUGAL leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD. Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

Lay Summary: As language models grow larger, training them requires enormous amounts of computer memory. Much of this memory is used by the optimization algorithm that guides the learning process. To address this problem, researchers have developed techniques like LoRA, GaLore, and BAdam that reduce memory usage by limiting updates to only certain parts of the model. However, these methods have a key limitation: they only make *low-rank updates at each step*, meaning they lose important information from the learning signal. This information loss can hurt performance, especially when training models from scratch. In this paper, we introduce 𝙵𝚁𝚄𝙶𝙰𝙻 (**F**ull-**R**ank **U**pdates with **G**r**A**dient sp**L**itting), a new approach that solves this problem. Our method splits the learning signal into two parts: important directions get updated using sophisticated algorithms like Adam (de-facto optimization algorithm in Deep Learning) , while the remaining directions use simpler, memory-efficient methods. This way, we keep all the information while still saving memory. We mathematically prove that out approach will converge to good solutions. In experiments, our method consistently outperforms other methods on both training new models and adapting existing ones, achieving the best results while using memory efficiently.

Link To Code: https://anonymous.4open.science/r/FRUGAL-D3CA

Primary Area: Optimization->Large Scale, Parallel and Distributed

Keywords: Memory-efficient training, Optimization, Full-Rank Update, Large Language Models, LLM, Pre-training, Fine-tuning

Submission Number: 908

Loading