SURFBoard: Reproducible Performance Analysis for Distributed Machine Learning WorkflowsDownload PDF


02 Mar 2021, 15:23 (edited 16 Apr 2021)JSYS 2021 Mar Papers Blind SubmissionReaders: Everyone
  • Keywords: Reproducibility, Performance Analysis, Distributed Machine Learning, Workflows, Large-scale Infrastructure
  • TL;DR: SURFBoard is a container-based framework for reproducible performance analysis of ML workflows at scale.
  • Abstract: Large-scale HPC infrastructures are enablers for scientific research in many domains. The recent advances in machine learning (ML) have led to an ever increasing demand for computation power, as well as the design of complex operational workflows. Understanding the performance and efficiency of these workflows is key to productivity, knowledge and model sharing, and energy efficiency. Even though there have been efforts in studying and designing portability protocols, performance analysis of large-scale ML is still an expert-driven task, tightly locked-in to specific physical and software infrastructure. Much like in other domains, this hinders reproducibility of both results and overall workflow performance. To overcome this challenge, we propose the design of a container-based framework for reproducible performance analysis of ML workflows at scale. We validate our framework using a case-study on two different large-scale production systems running ML workflows. We show empirically that our containerized approach is portable and allows arbitrarily low-level performance evaluation when run on two different, production-based HPC clusters with hundreds of GPUs. We report our findings on widely-used open-source software stacks and datasets and offer practitioners insights into what types of analyses our framework enables. To benefit the community, we open-source our software and results.
  • Area: Data Science and Reproducibility
  • Type: Tool/benchmark
  • Conflicts: All(Vrije Universiteit Amsterdam), All(University of Amsterdam), All(Leiden University)
6 Replies