Scalable I/O aggregation for asynchronous multi-level checkpointing

Published: 01 Jan 2024, Last Modified: 19 Feb 2025Future Gener. Comput. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Checkpointing is an increasingly frequent and needed operation of HPC applications.•Asynchronous checkpointing frameworks overlap computations and I/O to mask latency.•Such overlap results in applications and checkpointing frameworks sharing resources.•Asynchronous checkpointing uses one-file-per-process writing to ease I/O bottlenecks.•However, file-per-process writing is unsustainable for users and systems at scale.•Aggregation is necessary to alleviate usability and performance bottlenecks.•Yet, the impact of aggregation on asynchronous checkpointing is largely unexplored.•We implement an optimized aggregation scheme designed for asynchronous checkpointing.
Loading