Federated Learning, Lessons from Generalization Study: Communicate Less, Learn More

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Federated Learning, Generalization Error, SGD, PAC-Bayes, Rate-Distortion Theoretic bounds, Support Vector Machines
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We study the effect of number of rounds in the Federated Learning algorithm on the generalization error.
Abstract: We investigate the generalization error of statistical learning models in a Federated Learning (FL) setting. Specifically, we study the evolution of the generalization error with the number of communication rounds between the clients and the parameter server, i.e., the effect on the generalization error of how often the local models as computed by the clients are aggregated at the parameter server. We establish PAC-Bayes and rate-distortion theoretic bounds on the generalization error that account explicitly for the effect of the number of rounds, say $R \in \mathbb{N}^*$, in addition to the number of participating devices $K$ and individual datasets size $n$. The bounds, which apply in their generality for a large class of loss functions and learning algorithms, appear to be the first of their kind for the FL setting. Furthermore, we apply our bounds to FL-type Support Vector Machines (FSVM); and we derive (more) explicit bounds on the generalization error in this case. In particular, we show that the generalization bound of FSVM increases with $R$, suggesting that more frequent communication with the parameter server diminishes the generalization power of such learning algorithms. Combined with the fact that the empirical risk generally decreases for larger values of $R$, this indicates that $R$ might be a parameter to optimize to minimize the population risk of FSVM. Moreover, our bound suggests that for any $R$, the generalization error of the FSVM setting decreases faster than that of centralized learning by a factor of $\mathcal{O}(\sqrt{\log(K)/K})$, thereby generalizing recent findings in this direction for $R=1$ (sometimes referred to as ``one-shot'' FL or distributed learning) to any arbitrary number of rounds. Furthermore, we also provide results of experiments that are obtained using neural networks (ResNet-56), and which suggest that our observations for FSVM may hold true more generally.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1745
Loading