TL;DR: We explore a connection between linear gradient transformation and one-sided LoRA adapters, unifying existing approaches and deriving improved techniques for memory-efficient training
Abstract: We study memory-efficient optimization of neural networks (in particular language models) with *linear gradient transformations*, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map's transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through a *linear adapter* that additively modifies the model parameters, and then only optimizing the adapter's parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.
Lay Summary: One of the many aspects that makes training large language models (LLMs) challenging is the large amount of memory that is required to train them. Many methods have been proposed to deal with memory limitations, among them low-rank adapters (e.g., LoRA) and gradient transformations (e.g., GaLore).
In this paper, we show that these two methods, under certain conditions, are mathematically equivalent. This allows us to take techniques that were only applicable to one method and apply them to the other method, providing performance improvements.
Overall, our findings suggest ways to reduce the amount of resources required to train large language models. For example, we found that in a decentralized and memory-constrained setting, one can train better LLMs by being deliberate about the choice of gradient transformation that is used, so that each worker trains only one “slice” of the larger model.
Primary Area: Deep Learning->Large Language Models
Keywords: memory, efficient, training, gradient descent, pretraining, linear algebra, LLMs, GaLore, LoRA
Submission Number: 11631
Loading