Abstract: As deep learning continues to spearhead transformative breakthroughs across various domains, the computational and memory demands for training state-of-the-art models have surged exponentially. This escalation not only challenges the scalability of deep learning systems but also significantly increases the financial cost associated with training. Memory-intensive operations, particularly during the optimization phase of training large models, can drastically inflate budgets, making cutting-edge research and applications less accessible. In response to this challenge, we introduce a novel technique termed gradient space reutilization, aiming at reducing memory usage in deep network training by repurposing the memory allocated for gradients once it is no longer needed in the later computing process. This approach is implemented across modified versions of popular optimizers, with their names AdamW-R, Adan-R, and Lion-R, respectively, demonstrating appreciable memory savings without compromising the performance as they are equivalent to the original algorithms. Extensive experiments demonstrate that our simple engineering trick can achieve up to 25.60% memory savings at the best, providing a practical solution for efficient resource management in deep learning training environments.
Loading