Keywords: deep learning, large language models
Abstract: The scaling of neural networks with increasing data and model sizes necessitates the development of more efficient deep learning algorithms.
A significant challenge in neural network training is the memory footprint associated with activation tensors, particularly in pointwise nonlinearity layers that traditionally save the entire input tensor for the backward pass, leading to substantial memory consumption.
In this paper, we propose a modification to the handling of activation tensors in pointwise nonlinearity layers.
Our method involves saving the output tensor instead of the input tensor during the forward pass. Since the subsequent layer typically also saves its input tensor, this approach reduces the total memory required by storing only one tensor between layers instead of two. This optimization is especially beneficial for transformer-based architectures like GPT, BERT, Mistral, and Llama.
To enable this approach, we utilize the inverse function of the nonlinearity during the backward pass. As the inverse cannot be computed analytically for most nonlinearities, we construct accurate approximations using simpler functions.
Experimental results demonstrate that our method significantly reduces memory usage without affecting training accuracy or computational performance.
Our implementation is provided as a drop-in replacement for standard nonlinearity layers in the PyTorch framework, facilitating easy adoption without requiring architectural modifications. The code is available at \url{https://github.com/removed/for/anonimity}.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9876
Loading