Performance Roulette: How Cloud Weather Affects ML-Based System Optimization

Performance Roulette: How Cloud Weather Affects ML-Based System Optimization

NeurIPS 2023 Workshop MLSys Submission24 Authors

Published: 28 Oct 2023, Last Modified: 12 Dec 2023MlSys Workshop NeurIPS 2023 PosterEveryoneRevisionsBibTeX

Keywords: Database, benchmarking, performance, tuning, reliability, stability, interference, cloud computing, optimization

TL;DR: We find performance variability during ML training for systems degrades transferability of learned optimizations, and propose solutions.

Abstract: As system complexity, workload diversity, and cloud computing adoption continue to grow, both operators and developers are turning to machine learning (ML) based approaches for optimizing systems. ML based approaches typically perform measurements to evaluate candidate system configurations to discover the most optimal configuration. However, it is widely recognized that cloud systems can be effected by "cloud weather", i.e., shifts in performance due to hardware heterogeneity, interference from co-located workloads, virtualization overheads, etc. Given these two trends, in this work we ask: how much can performance variability during training affect ML approaches applied to systems? Using DBMS knob configuration tuning as a case study, we present two measurement studies that show how ML based optimizers can be affected by noise. This leads to four main observable problems: (1) there exist of very sensitive configurations, the performance of which do not transfer across machines of the same type, (2) unstable configurations during training significantly impact configuration transferability, (3) tuning in an environment with non-representative noise degrades final performance in the deployment environment, (4) sampling noise causes a convergence slowdown. Finally, we propose a set of methods to mitigate the challenges in measurements for training ML based system components.

Submission Number: 24

Loading