Linear Convergence of Decentralized FedAvg for Non-Convex Objectives: The Interpolation RegimeDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Keywords: Polyak-Lojasiewicz (PL) inequality, Federated Averaging, Linear convergence
TL;DR: Our work shows linear convergence for Federated Averaging algorithm in {\em Server} and {\em Decentralized} settings.
Abstract: In the age of Bigdata, Federated Learning (FL) provides machine learning (ML) practitioners with an indispensable tool for solving large-scale learning problems. FL is a distributed optimization paradigm where multiple nodes each having access to a local dataset collaborate (with or without a server) to solve a joint problem. Federated Averaging (FedAvg) although the algorithm of choice for many FL applications is not very well understood especially in the interpolation regime, a common phenomenon observed in modern overparameterized neural networks. In this work, we address this challenge and perform a thorough theoretical performance analysis of FedAvg in the interpolation regime for training of overparameterized neural networks. Specifically, we analyze the performance of FedAvg in two settings: (i) {\em[Server]}: When the network has access to a server that coordinates the information sharing among nodes, and (ii) {\em[Decentralized]:} The serverless setting, where the local nodes communicate over an undirected graph. We consider a class of non-convex functions satisfying the Polyak-Lojasiewicz (PL) condition, a condition that is satisfied by overparameterized neural networks. For the first time, we establish that FedAvg under both {\em Server} and {\em Decentralized} settings achieve linear convergence rates of $\mathcal{O}(T^{3/2} \log (1/{\epsilon} ) )$ and $\mathcal{O}({T^2} \log ({1}/{\epsilon}))$, respectively, where $\epsilon$ is the desired solution accuracy, and $T$ is the number of local updates at each node. In contrast to the standard FedAvg analysis, our work does not require bounded heterogeneity, variance, and gradient assumptions. Instead, we show that sample-wise (and local) smoothness of the local loss functions suffice to capture the effect of heterogeneity in FL training. We use a novel application of induction to prove the linear convergence in the {\em Decentralized} setting, which can be of independent interest. Finally, we conduct experiments on multiple real datasets to corroborate our theoretical findings.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Optimization (eg, convex and non-convex optimization)
Supplementary Material: zip
6 Replies

Loading