Efficient multi-job federated learning scheduling with fault tolerance

Boqian Fu, Fahao Chen, Shengli Pan, Peng Li, Zhou Su

Published: 2025, Last Modified: 25 Jan 2026Peer Peer Netw. Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Federated Learning (FL) has emerged as a promising learning approach for utilizing data distributed across edge devices. However, existing works mainly focus on single-job FL systems. In practice, multiple FL jobs will be submitted simultaneously. How to schedule multiple FL jobs is crucial for client resource utilization and job efficiency. In addition, existing works assume that clients are always available during FL jobs, which is often not a reality since clients could be unavailable for FL jobs due to various reasons. To address these challenges, in this paper, we introduce a novel fault-tolerance multi-job scheduling strategy aimed at optimizing job efficiency and resource utilization. The basic idea of our approach is a redundancy-based fault tolerance mechanism, which is designed to ensure the robustness of FL jobs even with insufficient clients. The mechanism strategically selects clients for redundant model training. Based on the mechanism, the scheduling algorithm prioritizes urgent FL jobs, facilitating their completion and obviating the need for prolonged waiting periods for additional client availability. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, which can significantly outperform other baseline methods.

External IDs:dblp:journals/ppna/FuCPLS25