Abstract: Distributed optimization has become the default training paradigm in modern machine learning due to the growing scale of models and datasets. To mitigate communication overhead, local updates are often applied before global aggregation, resulting in a nested optimization approach with inner and outer steps. However, heavy-tailed stochastic gradient noise remains a significant challenge, particularly in attention-based models, hindering effective training. In this work, we propose TailOPT, an efficient framework designed to address heavy-tailed noise by leveraging adaptive optimization and novel clipping techniques. We establish convergence guarantees for the TailOPT framework under heavy-tailed noise with local updates and potentially unbounded gradient variance.
Among its variants, we propose a memory- and communication-efficient instantiation (named $Bi^2Clip$) that performs coordinate-wise clipping from both above and below at both the inner and outer optimizers. $Bi^2Clip$ brings about benefits of adaptive optimization (e.g., Adam) without the cost of maintaining or transmitting additional gradient statistics. Empirically, TailOPT, including $Bi^2Clip$, demonstrates superior performance on various tasks and models compared with state-of-the-art methods, while being more efficient.
Lay Summary: As deep neural networks grow larger and more powerful, making them efficient and scalable becomes increasingly important, especially for models like transformers used in language and AI applications. A major bottleneck is the optimizer, the tool that guides how these models learn. Popular choices like the Adam optimizer use a lot of memory, which limits how big models can get or how quickly they can be trained. In this work, we introduce a new optimizer called **$BiClip$**, which cuts memory significantly compared with state-of-the-art approaches while still improving performance. We also incorporate this into distributed settings by proposing general framework called **TailOPT**, and rigorously verify that our approach works, using novel theoretical arguments. This makes training large models faster, cheaper, and more accessible without sacrificing quality.
Primary Area: Optimization
Keywords: distributed optimization, adaptive optimization, scalable algorithms
Submission Number: 3923
Loading