Incorporating Staggered Planned Maintenance Reservations to Improve Performance in Computational Clusters

Published: 01 Jan 2023, Last Modified: 26 Jul 2025CLUSTER Workshops 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Planned maintenance reservations (PMRs) are often employed through advance reservation of cluster and cluster resources as a dedicated and coordinated effort to provide a future time where needed updates, upgrades, replacements and other types of service can be performed to both hardware and software in large-scale computational clusters. Within the US DOE complex, scheduled security updates are one example where every node in the entire cluster must undergo updates in order to be compliant with governing regulations. As a result, whole-cluster PMRs are routinely scheduled at regular intervals in order to accomplish this and other needed tasks. While it may be convenient for system administrators to have access to the entire cluster at the same time to complete these updates, it is not necessarily ideal from the users’ perspective in that there are times that the entire cluster is unavailable to execute any job. One alternative to whole-cluster PMRs are staggered PMRs where only part of the cluster is offline at any one instant. As a result, in this paper, we make use of a modified version of a well-known simulation framework to conduct several parameter studies where PMRs are staggered in a variety of ways in order to determine the impact of staggered reservations, not only on the overall cluster, but also on specific classes of jobs. For the workload tested, staggered PMRs can achieve moderate improvements to queue waiting time across the entire machine, and especially for smaller jobs, without any significant negative impact to any one class of jobs, even larger ones.
Loading