Abstract: In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision.
Lay Summary: Modern deep learning models are usually trained using algorithms that adapt during training to the structure of the data. In this work, we propose a new family of training methods that, instead, adapt in advance to the model’s structure—using mathematical tools that respect how neural networks are built. Our method leads to faster training, requires less memory, and avoids the need for commonly used algorithms like the Adam optimizer. It also allows settings like the learning rate to be reused across different model sizes, making it easier to scale up models. We demonstrate that our method can train large models more efficiently, including popular architectures like GPT and vision transformers.
Link To Code: https://github.com/LIONS-EPFL/scion
Primary Area: Deep Learning->Algorithms
Keywords: non-euclidean, linear minimization oracle, deep learning, spectral norm
Submission Number: 12949
Loading