MPDA: A Massively Parallel Learning and Dependency-Aware Scheduling Algorithm for Data Processing Clusters

Published: 2025, Last Modified: 06 Jan 2026IEEE Trans. Serv. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In the era of large-scale machine learning, large-scale clusters are extensively used for data processing jobs. However, the state-of-the-art heuristic-based and Deep Reinforcement Learning (DRL) based job scheduling mechanisms are facing challenges such as slow training speed and under-exploitation of jobs’ complex dependencies. We propose MPDA, a Massively Parallel learning and Dependency-Aware scheduling algorithm, consisting of a fast-training mechanism and a novel dependency-aware policy network, GATNetwork, to address these two challenges respectively. The fast-training mechanism is a two-level massively parallel training method that can significantly accelerate the training process and maximally utilize the resources of the cluster. Additionally, its decoupled learning and interacting design enables hybrid-workload training for MPDA, which guarantees the generalization and robustness of MPDA. The GATNetwork exploits the dependencies among stages/jobs using Graph Attention Network (GAT) and Long Short-Term Memory (LSTM) networks to improve the performance of the scheduling policy. The experiments show that MPDA accelerates the training speed by one to two orders of magnitude and achieves better scheduling performance, i.e., lower average job completion time, compared with existing scheduling algorithms.
Loading