Cloud-Bursting and Autoscaling for Python-Native Scientific Workflows Using Ray

Published: 01 Jan 2023, Last Modified: 15 Feb 2025ISC Workshops 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We have extended the Ray framework to enable automatic scaling of workloads on high-performance computing (HPC) clusters managed by SLURM© and bursting to Cloud managed by Kubernetes®. Compared to existing HPC-Cloud convergence solutions, our framework demonstrates advantages in several aspects: users can provide their own Cloud resource, the framework provides the Python-level abstraction that does not require users to interact with job submission systems, and allows a single Python-based parallel workload to be run concurrently across an HPC cluster and a Cloud. Applications in Electronic Design Automation are used to demonstrate the functionality of this solution in scaling the workload on an on-premises HPC system and automatically bursting to a public Cloud when running out of allocated HPC resources. The paper focuses on describing the initial implementation and demonstrating novel functionality of the proposed framework as well as identifying practical considerations and limitations for using Cloud bursting mode. The code of our framework is open-sourced.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview