Abstract: As scientific research grows more complex, a wide range of HPC-AI workflows have been proposed, in which machine learning model learns from interactions with ensemble simulations on HPC clusters. However, existing distributed frameworks and solutions fall short in facilitating this intensive integration, struggling with challenges like non-trivial application migration, heterogeneous task management and inefficient I/O operations. To address these challenges, we present RTAI, a distributed runtime system aimed at optimizing emerging HPC-AI workflows. Specifically, it consists of several key components: RTAI Packer is a flexible abstraction that parallelizes and constructs a dynamic dependency graph of the workflow. RTAI Orchestrator improves workflow efficiency and cluster-wide resource utilization by adaptively organizing resources and managing heterogeneous tasks. RTAI FileHub is a tailored ad-hoc file system that is employed for in-memory caching of dynamically generated files and enhancing I/O efficiency. Our experiments show that RTAI significantly improves the makespan, cluster utilization, scalability and usability for HPC-AI applications when compared to other candidate solutions like Ray and Radical-Pilot.
Loading