Exploring the Trade-Off Between Repair Time and Reliability in Large Scale Cluster Computers: A Simulation-Based Approach

Published: 2024, Last Modified: 25 Jan 2026HPEC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: As the size of high performance computing (HPC) computational clusters continues to increase in performance, scale and component count, the role that reliability and particularly the repair time plays a significant role in system specification, procurement, and ultimate operation of such systems. System administrators must find a balance among competing factors: initial capital investment, operational costs and observed system performance and utility from the end users' perspectives are chief among them. In this paper, we explore the tradeoff between reliability, performance and node repair times in large-scale high performance computing (HPC) computational clusters using real historical workloads from Los Alamos National Laboratory (LANL). We enhance an existing cluster simulator to more quickly perform the large-scale parameter sweeps necessary to obtain meaningful results for these studies, in some cases by several orders of magnitude. Our results show that these simulations can be parameterized to identify trends that can be used to make decisions about system procurement and operation as a function of the operational parameters and constraints.
Loading