Overcoming Resource Constraints in Federated Learning: Large Models Can Be Trained with only Weak Clients
Abstract: Federated Learning (FL) is emerging as a popular, promising decentralized learning framework that enables collaborative training among clients, with no need to share private data between them or to a centralized server. However, considering many edge clients do not have sufficient computing, memory, or communication capabilities, federated learning of large models still faces significant bottlenecks. To keep such weak but crucial clients in the loop, prior works either consider a heterogeneous-client setting where clients train models with different sizes; or offload training to the server. However, the heterogeneous-client setting requires some clients to train full model, which is not aligned with the resource-constrained setting; while the latter ones break privacy promises in FL when sharing intermediate representations or labels with the server. To overcome these limitations, in this work, we formulate a realistic, but much less explored, cross-device FL setting in which no client can train a full large model nor is willing to share any intermediate information with the remote server. Under such a formulation, we develop a principal sub-model (PriSM) training methodology to collaboratively train a full large model, while assigning each client a small sub-model that is a probabilistic low-rank approximation to the full server model. When creating sub-models, PriSM first performs a principal kernel analysis in the orthogonal kernel space to obtain importance of each kernel. Then, PriSM adopts a novel importance-aware sampling process to select a subset of kernels (i.e., a kernel with high importance is assigned with a higher sampling probability). This sampling process ensures each sub-model is still a low-rank approximation to the full model, while all sub-models together achieve nearly full coverage on the principal kernels. To further improve memory efficiency while still preserving accuracy, PriSM also exploits low-rank structure in intermediate representations and allows each sub-model to learn only a subset of them. Our evaluations on various datasets and models (CNNs, LSTMs, Transformers) under different resource-constrained settings demonstrate that PriSM yields an accuracy improvement of up to $10\%$ compared to existing works. More importantly, PriSM does not incur significant accuracy degradation compared to full-model training (e.g., only $\sim 2\%$ accuracy drops for ResNet-18/CIFAR-10 when clients train only $0.2\times$ sub-models).
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Virginia_Smith1
Submission Number: 1262