LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions

Published: 11 Feb 2025, Last Modified: 13 May 2025MLSys 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: ml for systems, cloud computing, vm scheduling
TL;DR: Prior work improved VM scheduling in data centers with predicted VM lifetimes. We further improve lifetime-aware scheduling by repredicting lifetimes and correcting mispredictions. We also present novel approaches for defragmentation and maintenance.
Abstract: Scheduling virtual machines (VMs) on hosts in cloud data centers dictates efficiency and is an NP-hard problem with incomplete information. Prior work improved VM scheduling with predicted VM lifetimes. Our work further improves lifetime-aware scheduling using repredictions with lifetime distributions versus one-shot prediction. Our approach repredicts and adjusts VM and host lifetimes when incorrect predictions emerge. We also present novel approaches for defragmentation and regular system maintenance, which are essential to our data center reliability and optimizations, and are not explored in prior work. We show repredictions deliver a fundamental advance in effectiveness over one-shot prediction. We call our novel combination of distribution-based lifetime predictions and scheduling algorithms Lifetime Aware VM Allocation (LAVA). LAVA reduces resource stranding and increases the number of empty hosts, which are critical for large VM scheduling, cloud system updates, and reducing dynamic energy consumption. Our approach runs in production within Google's hyperscale cloud data centers, where it improves efficiency by decreasing stranded compute and memory resources by ~3% and ~2% respectively. It increases empty hosts by 2.3-9.2 pp in production, reducing dynamic energy consumption, and increasing availability for large VMs and cloud system updates. We also show a reduction in VM migrations for host defragmentation and maintenance. In addition to our fleet-wide production deployment, we perform simulation studies to characterize the design space and show that our algorithm significantly outperforms the prior state of the art lifetime-based scheduling approach.
Supplementary Material: pdf
Submission Number: 227
Loading