Track: long paper (up to 4 pages)
Keywords: Deep Learning, Cloud Computing, Resource Management, Reinforcement Learning, Virtual Machine Allocation, Workload Optimization, Model Deployment, Distribution Shift, Scalability, Interpretability, Computational Efficiency, Auto-Scaling, Scheduling, Cost Optimization, Adaptive Learning
TL;DR: Our DRL-based VM scheduler underperformed vs. heuristics. We analyze failures and offer solutions for robust AI-driven cloud management.
Abstract: Deep learning has shown promise in optimizing cloud resource management by enabling dynamic workload scheduling, auto-scaling, and cost-efficient operations. However, our real-world deployment of a deep reinforcement learning-based (DRL) scheduler for virtual machine (VM) allocation and scaling in a multi-cloud environment revealed unexpected failures. Despite extensive training on historical workload data, the model underperformed compared to rule-based heuristics due to distribution shifts, delayed feedback loops, and computational inefficiencies. This paper investigates the root causes of these failures, highlights key challenges in applying deep learning to cloud infrastructure, and provides actionable recommendations for improving robustness, scalability, and interpretability in real-world AI-driven cloud management systems.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 8
Loading