SURFBoard: Reproducible Performance Analysis for Distributed Machine Learning WorkflowsDownload PDF

Anonymous

02 Mar 2021 (modified: 05 May 2023)JSYS 2021 Mar Papers Blind SubmissionReaders: Everyone
Keywords: Reproducibility, Performance Analysis, Distributed Machine Learning, Workflows, Large-scale Infrastructure
TL;DR: SURFBoard is a container-based framework for reproducible performance analysis of ML workflows at scale.
Abstract: Large-scale HPC infrastructures are enablers for scientific research in many domains. The recent advances in machine learning (ML) have led to an ever increasing demand for computation power, as well as the design of complex operational workflows. Understanding the performance and efficiency of these workflows is key to productivity, knowledge and model sharing, and energy efficiency. Even though there have been efforts in studying and designing portability protocols, performance analysis of large-scale ML is still an expert-driven task, tightly locked-in to specific physical and software infrastructure. Much like in other domains, this hinders reproducibility of both results and overall workflow performance. To overcome this challenge, we propose the design of a container-based framework for reproducible performance analysis of ML workflows at scale. We validate our framework using a case-study on two different large-scale production systems running ML workflows. We show empirically that our containerized approach is portable and allows arbitrarily low-level performance evaluation when run on two different, production-based HPC clusters with hundreds of GPUs. We report our findings on widely-used open-source software stacks and datasets and offer practitioners insights into what types of analyses our framework enables. To benefit the community, we open-source our software and results.
Area: Data Science and Reproducibility
Type: Tool/benchmark
Conflicts: All(Vrije Universiteit Amsterdam), All(University of Amsterdam), All(Leiden University)
6 Replies

Loading