Keywords: Distributed Machine Learning, Federated Learning
Abstract: In Federated Learning (FL), client devices collaboratively train a model without sharing the private data present on the devices. Federated Stochastic Gradient Descent (FedSGD) is a recent generalisation of the popular Federated Averaging algorithm. Recent works show that when client data is distributed heterogeneously, the loss function minimised by FedSGD differs from the 'true' loss that would be minimised by centralised training. Previous works propose decaying the client learning rate, $\gamma$, to allow FedSGD to minimise the true loss. We propose instead decaying the number of local SGD steps, $K$, that clients perform during training rounds to allow minimisation of the true loss. Decaying $K$ has the added benefit of reducing the total computation that clients perform during FedSGD. Real-world applications of FL use large numbers of low-powered smartphone or Internet-of-Things clients, so reduction of computation would provide significant savings in terms of energy and time. In this work, we prove for quadratic objectives that annealing $K$ allows FedSGD to approach the true minimiser. We then perform thorough experimentation on three benchmark FL datasets to show that decaying $K$ can achieve the same generalisation performance as decaying $\gamma$, but with up to $3.8\times$ less total steps of SGD performed by clients.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2305.09628/code)
Reviewed Version (pdf): https://openreview.net/references/pdf?id=YjKZyORuxRz
4 Replies
Loading