Faster and Scalable MPI Applications Launching

Yong Dong, Yiqin Dai, Min Xie, Kai Lu, Ruibo Wang, Juan Chen, Mingtian Shao, Zheng Wang

Published: 01 Jan 2024, Last Modified: 17 Jul 2025IEEE Trans. Parallel Distributed Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Distributed parallel MPI applications are the dominant workload in many high-performance computing systems. While optimizing MPI application execution is a well-studied field, little work has considered optimizing the initial MPI application launching phase, which incurs extensive cross-machine communications and synchronization. The overhead of MPI application launching can be expensive, accounting for more than million core hours per 10K nodes annually on the production Tianhe-2A supercomputer, which will increase as the number of parallel machines used grows. Therefore, it is critical to optimize the MPI application launching process. This paper presents a novel approach to optimizing the MPI application launch. Our approach adopts a location-aware address generation rule to eliminate the need for address exchange and a topology-aware global communication scheme to optimize cross-machine synchronization. We then design a new application launch procedure to support the proposed optimizations to further reduce the pressure of the shared I/O system. Our techniques have been deployed to production in the Tianhe-2A supercomputer and the Next Generation Tianhe Supercomputer. Experimental results show that our approach scales well and outperforms alternative schemes, reducing the MPI application launching time by over 29% with 320K MPI processes.