A Reinforcement Learning Approach for Minimizing Job Completion Time in Clustered Federated Learning

Ruiting Zhou; Jieling Yu; Ruobei Wang; Bo Li; Jiacheng Jiang; Libing Wu

A Reinforcement Learning Approach for Minimizing Job Completion Time in Clustered Federated Learning

Ruiting Zhou, Jieling Yu, Ruobei Wang, Bo Li, Jiacheng Jiang, Libing Wu

Published: 01 Jan 2023, Last Modified: 15 May 2025INFOCOM 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Federated Learning (FL) enables potentially a large number of clients to collaboratively train a global model with the coordination of a central cloud server without exposing client raw data. However, the FL model convergence performance, often measured by the job completion time, is hindered by two critical factors: non independent and identically distributed (non-IID) data across clients and the straggler effect. In this work, we propose a clustered FL framework, MCFL, to minimize the job completion time by mitigating the influence of non-IID data and the straggler effect while guaranteeing the FL model convergence performance. MCFL builds upon a two-stage operation: i) a clustering algorithm constructs clusters, each containing clients with similar computing and communications capabilities to combat the straggler effect within a cluster; ii) a deep reinforcement learning (DRL) algorithm based on soft actor-critic with discrete actions intelligently selects a subset of clients from each cluster to mitigate the impact of non-IID data, and derives the number of intra-cluster aggregation iterations for each cluster to reduce the straggler effect among clusters. Extensive testbed experiments are conducted under various configurations to verify the efficacy of MCFL. The results show that MCFL can reduce the job completion time by up to 70% compared with three state-of-the-art FL frameworks.

Loading