IO-Adam: Rethinking Memory-Efficient Adaptive Optimizers from Gradient Computation

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Optimizer; Low rank adaptation
Abstract: Adaptive Moment Estimation (Adam) is one of the most popular stochastic optimizers for deep neural network training and has become the default optimizer in many scenarios, especially on language tasks. With the first and second moment estimation, Adam provides adaptive learning rates for each parameter, significantly outperforming Stochastic Gradient Descent (SGD). However, as the deep neural networks become larger, the estimation of the first and second moments takes up substantial memory, motivating methods to reduce the memory usage for adaptive optimizers. In this paper, we propose to rethink the first and second moment estimation from a gradient computation perspective. The gradient of the weight matrix is the multiplication of the input and the gradient of the output. Instead of trying to find a low-rank approximation for the first and second moment estimation as in previous works, we propose to track the input and the output gradient for efficient moment estimation. We provide analyses on the connection and difference between our proposed method, the widely used Adam optimizer, and previous memory-efficient optimizers proposed to reduce the memory usage. We conduct experiments to verify the effectiveness of our method, where our method reduces the memory usage by up to $30$% while preserving similar performance or even improving the performance of Adam.
Primary Area: optimization
Submission Number: 11844
Loading