SHARP: A Distribution-Based Framework for Reproducible Performance Evaluation

Viyom Mittal, Pedro Bruel, Michalis Faloutsos, Dejan S. Milojicic, Eitan Frachtenberg

Published: 01 Jan 2024, Last Modified: 20 May 2025IISWC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Performance evaluation studies often produce unreliable or irreproducible results because: (a) measurements have high variability due to multiple system variables and diverse operational conditions, (b) reported results are often focused on point-summary statistics, such as average values. Despite recent efforts, there does not exist a general framework to assess and compare the performance of high-performance systems in a principled and reproducible way.This paper addresses this critical gap by introducing Sharp, an open-source framework designed to redefine performance evaluation following a reproducibility-first approach. Sharp enables and facilitates a comprehensive characterization of the performance distribution of an application, while orchestrating experiments efficiently. Sharp addresses these key challenges using (a) robust performance analysis and comparison with Similarity Metrics; (b) the automatic determination of a reliable sample size through a diverse set of Stopping Rules; and (c) comprehensive recording of experimental conditions and results.We showcase the need for and advantages of Sharp by evaluating the performance of 20 Rodinia benchmarks on 3 HPC servers with different CPU and GPU configurations. We empirically evaluate Sharp to expose the need for distribution-based statistics, and demonstrate how the stopping rules of Sharp attain reliable performance results while minimizing resource usage up to ∼90% relative to a large fixed number of experiments sufficient enough to establish ground-truth. We see the Sharp framework as a fundamental step towards providing customers and engineers with a reproducible and reliable way to reason and compare the performance of HPC applications and infrastructure.