MIST: Towards MPI Instant Startup and Termination on Tianhe HPC Systems

Yiqin Dai, Ruibo Wang, Yong Dong, Min Xie, Juan Chen, Wenzhe Zhang, Huijun Wu, Mingtian Shao, Kai Lu

Published: 2025, Last Modified: 18 Mar 2026IEEE Trans. Parallel Distributed Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: As the size of MPI programs grows with expanding HPC resources and parallelism demands, the overhead of MPI startup and termination escalates due to the inclusion of less scalable global operations. Global operations involving extensive cross-machine communication and synchronization are crucial for ensuring semantic correctness. The current focus is on optimizing and accelerating these global operations rather than removing them, as the latter involves systematic changes to the system software stack and may impact program semantics. Given this background, we propose a systematic solution named MIST to safely eliminate global operations in MPI startup and termination. Through optimizing the generation of communication addresses, designing reliable communication protocols, and exploiting the resource release mechanism, MIST eliminates all global operations to achieve MPI instant startup and termination while ensuring correct program execution. Experiments on Tianhe-2 A supercomputer demonstrate that MIST can reduce the MPI_Init() time by 32.5-77.6% and the MPI_Finalize() time by 28.9-85.0%.
Loading