Track: Main paper track (up to 5 pages excluding references and appendix)
Keywords: Distributed Optimization, Adaptive Optimization, Scalable Algorithms
Abstract: Distributed optimization is essential for scaling modern machine learning, yet communication overhead remains a challenge. Local updates reduce this cost but introduce a nested optimization structure, where heavy-tailed gradient noise--especially in attention-based models--impairs convergence. We propose TailOPT, a framework leveraging adaptive optimization and clipping to address heavy-tailed noise, with convergence guarantees under unbounded stochastic gradient variance and local updates. Among its variants, we introduce $Bi^2Clip$, which applies coordinate-wise clipping at both inner and outer optimizers, achieving adaptive-like performance (e.g., Adam) without the overhead of maintaining or transmitting preconditioners. Empirically, TailOPT, including $Bi^2Clip$, outperforms state-of-the-art methods across multiple language tasks and models.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 35
Loading