A New Theoretical Perspective on Data Heterogeneity in Federated Averaging

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Federated Learning, Data Heterogeneity, Theoretical Analysis
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This paper bridges the gap between the pessimistic theoretical results and the good experimental results for federated algorithms by introducing a new perspective of the data heterogeneity.
Abstract: In federated learning, data heterogeneity is the main reason that existing theoretical analyses are pessimistic about the convergence error caused by local updates. However, empirical studies have shown that more local updates can improve the convergence rate and reduce the communication cost when data are heterogeneous. This paper aims to bridge this gap between the theoretical understanding and the practical performance by providing a theoretical analysis for federated averaging (FedAvg) with non-convex objective functions from a new perspective on data heterogeneity. Identifying the limitations in the commonly used assumption of bounded gradient divergence, we propose a new assumption, termed the heterogeneity-driven Lipschitz assumption, which characterizes the fundamental effect of data heterogeneity on local updates. In the convergence analysis, we use the heterogeneity-driven Lipschitz constant and the global Lipschitz constant to substitute the widely used local Lipschitz constant and we show that our assumptions are weaker than those used in the literature. Based on the new assumption, we derive novel convergence bounds for both full participation and partial participation, which are tighter compared to the state-of-the-art analysis of FedAvg. This result can also imply that more local updates can improve the convergence rate even when data are highly heterogeneous. Further, we discuss the insights behind the proposed heterogeneity-driven Lipschitz assumption, by which we identify a region where FedAvg (also known as local SGD) can outperform mini-batch SGD even when the gradient divergence is arbitrarily large.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4651
Loading