AdaPM: a Partial Momentum Algorithm for  LLM Training

Yimu Zhang; Yuanshi Liu; Cong Fang

AdaPM: a Partial Momentum Algorithm for LLM Training

Yimu Zhang, Yuanshi Liu, Cong Fang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model, memory, optimizer

Abstract: In the training of large language models, momentum is often able to achieve significant acceleration and is widely used. However, storing momentum typically presents memory challenges. In this paper, we propose AdaPM, a training strategy that leverages the adaptive partial momentum to implement a memory-efficient optimizer using partial momentum. We first show that the momentum in transformers optimizer also remain highly redundant and demonstrate that most blocks do not require full momentum acceleration, therefore assigning different momentum designs to different blocks. We further improve the partial momentum by a bias-corrected approach using error-feedback technique. Empirically, we verify that our approach reduces memory by up to over $90\%$ in momentum while maintaining both efficiency and performance for pretraining GPT-2 and LLama architectures sized from 60M to 1.5B and for supervised fine-tuning and RLHF on Ultrafeedback dataset. AdaPM can further reduces memory by up to $90\%$ in optimizer states by combining the memory-efficient technique on the secondorder statistic, saving over $30\%$ GPU hours for pretraining.

Primary Area: optimization

Submission Number: 23808

Loading