Combining Checkpoint/Restart and Replication for Fault Tolerance with High Performance

Sarthak Joshi, Sathish Vadhiyar

Published: 2024, Last Modified: 13 May 2025HIPCW 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher frequency, resulting in an excessive amount of overhead, which would not be sustainable for many scientific applications. To improve application efficiency in such high-failure environments, the mechanism of replication of MPI processes was proposed. Replication allows for fast recovery from failures by simply dropping the failed processes and using their replicas to continue the regular operation of the application. We have implemented FTHP-MPI (Fault Tolerance and High-Performance MPI), a novel fault-tolerant MPI library that augments checkpoint/restart with replication to provide resilience from failures. The novelty of our work is that it is designed to provide fault tolerance in a native MPI library. This lets application developers achieve fault tolerance at high failure rates while also using efficient communication protocols in the native MPI libraries that are generally fine-tuned for specific HPC platforms. Our work employs various concepts, including MPI-agnostic check-pointing from DMTCP and communicator shrinking from ULFM, and implements them together without any code modifications on the user end. We have also implemented efficient parallel communication techniques that involve replicas.