Efficient Distributed Optimization under Heavy-Tailed Noise

Su Hyeong Lee; Manzil Zaheer; Tian Li

Efficient Distributed Optimization under Heavy-Tailed Noise

Su Hyeong Lee, Manzil Zaheer, Tian Li

Published: 05 Mar 2025, Last Modified: 14 Apr 2025SCOPE - ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Main paper track (up to 5 pages excluding references and appendix)

Keywords: Distributed Optimization, Adaptive Optimization, Scalable Algorithms

Abstract: Distributed optimization is essential for scaling modern machine learning, yet communication overhead remains a challenge. Local updates reduce this cost but introduce a nested optimization structure, where heavy-tailed gradient noise--especially in attention-based models--impairs convergence. We propose TailOPT, a framework leveraging adaptive optimization and clipping to address heavy-tailed noise, with convergence guarantees under unbounded stochastic gradient variance and local updates. Among its variants, we introduce $Bi^2Clip$, which applies coordinate-wise clipping at both inner and outer optimizers, achieving adaptive-like performance (e.g., Adam) without the overhead of maintaining or transmitting preconditioners. Empirically, TailOPT, including $Bi^2Clip$, outperforms state-of-the-art methods across multiple language tasks and models.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 35

Loading