Abstract: Training deep neural networks (DNNs) on memory-limited GPUs is challenging, as storing intermediate activations often exceeds available memory. Re-materialization, a technique that preserves exact computations, addresses this by selectively recomputing activations instead of storing them. However, existing methods either fail to scale, lack generality, or introduce excessive execution overhead. We introduce ${\mbox{HiRemate}}$ a ${\textit hierarchical}$ re-materialization framework that recursively partitions large computation graphs, applies optimized solvers at multiple levels, and merges solutions into a global efficient training schedule. This enables scalability to significantly larger graphs than prior ILP-based methods while keeping runtime overhead low. Designed for single-GPU models and activation re-materialization, HiRemate extends the feasibility of training networks with thousands of graph nodes, surpassing prior methods in both efficiency and scalability. Experiments on various types of networks yield up to 50-70% memory reduction with only 10-15% overhead, closely matching optimal solutions while significantly reducing solver time. Seamlessly integrating with PyTorch Autograd, HiRemate requires almost no code change to use, enabling broad adoption in memory-constrained deep learning.
Lay Summary: Training deep neural networks requires storing many intermediate results during the forward pass so they can be reused during backpropagation. Although the model’s weights may fit on a single GPU, the total memory needed for training can exceed the device’s capacity, largely due to the size of these intermediate values. One way to reduce memory usage is through re-materialization, which selectively recomputes some of them instead of storing everything. However, for large models, deciding what to recompute is a challenging problem.
We introduce HiRemate, a framework that tackles this problem in a hierarchical manner. The computation graph of the neural network is first divided into parts small enough to make the problem easy to solve. Thanks to our algorithm, these partial solutions are then merged—several times if necessary—until we obtain a complete solution for the entire graph. HiRemate is designed for models whose weights fit in GPU memory and focuses on reducing activation memory during training. It also supports re-materialization strategies from the literature, making it easy to combine different methods within a single framework.
We tested HiRemate on a range of common neural networks and consistently saw large memory savings with only a small increase in training time. This makes it easier to train modern deep learning models on limited hardware.
Primary Area: General Machine Learning->Hardware and Software
Keywords: Rematerialization, Checkpointing, Memory-Efficient Training, Neural Networks, PyTorch, Integer Linear Programming, Training
Submission Number: 16364
Loading