Keywords: large language model, memory, optimizer
Abstract: In the training of large language models, momentum is often able to achieve significant acceleration and is widely used. However, storing momentum typically presents memory challenges. In this paper, we propose AdaPM, a training strategy that leverages the adaptive partial momentum to implement a memory-efficient optimizer using partial momentum. We first show that the momentum in transformers optimizer also remain highly redundant and demonstrate that most blocks do not require full momentum acceleration, therefore assigning different momentum designs to different blocks. We further improve the partial momentum by a bias-corrected approach using error-feedback technique. Empirically, we verify that our approach reduces memory by up to over $90\%$ in momentum while maintaining both efficiency and performance for pretraining GPT-2 and LLama architectures sized from 60M to 1.5B and for supervised fine-tuning and RLHF on Ultrafeedback dataset. AdaPM can further reduces memory by up to $90\%$ in optimizer states by combining the memory-efficient technique on the secondorder statistic, saving over $30\%$ GPU hours for pretraining.
Primary Area: optimization
Submission Number: 23808
Loading