FEDERATED COMPOSITIONAL OPTIMIZATION: THE IMPACT OF TWO-SIDED LEARNING RATES ON COMMUNICATION EFFICIENCY

27 Sept 2024 (modified: 21 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Compositional optimization, Federated learning, Federated averaging, Distributed learning
Abstract: Compositional optimization (CO) has recently gained popularity due to its applications in distributionally robust optimization (DRO), meta-learning, reinforcement learning, and many other machine learning applications. The large-scale and distributed nature of data necessitates efficient federated learning (FL) algorithms for CO, but the compositional structure of the objective poses significant challenges. Current methods either rely on large batch gradients (which are impractical) or suffer from suboptimal communication efficiency. To address these challenges, we propose efficient FedAvg-type algorithms for solving non-convex CO in the FL setting. We first establish that standard FedAvg fails in solving the federated CO problems due to data heterogeneity, which amplifies bias in local gradient estimates. Our analysis establishes that either {\em additional communication} or {\em two-sided learning rate-based} algorithms are required to control this bias. To this end, we develop two algorithms for solving the federated CO problem. First, we propose FedDRO that utilizes the compositional problem structure to design a communication strategy that allows FedAvg to control the bias in the estimation of the compositional gradient, achieving $\mathcal{O}(\epsilon^{-2})$ sample and $\mathcal{O}(\epsilon^{-3/2})$ communication complexity. Then we propose DS-FedDRO, a two-sided learning rate algorithm, that eliminates the need for additional communication and achieves the optimal $\mathcal{O}(\epsilon^{-2})$ sample and $\mathcal{O}(\epsilon^{-1})$ communication complexity, highlighting the importance of two-sided learning rate algorithms for solving federated CO problems. The proposed algorithms avoid the need for large batch gradients and achieve linear speedup with the number of clients. We corroborate our theoretical findings with empirical studies on large-scale DRO problems.
Supplementary Material: pdf
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12385
Loading