Keywords: Cloud Systems Management, Resource Management, Foundation Models, Reinforcement Learning, Machine Learning, Meta Learning
TL;DR: We demonstrated a foundation model design for cloud systems management and discussed the unique risks and challenges of developing/deploying foundation models for cloud intelligence or AIOps.
Abstract: Foundation models (FMs) are machine learning models that are trained broadly on large-scale data and can be adapted to a set of downstream tasks via fine-tuning, few-shot learning, or even zero-shot learning. Despite the successes of FMs in the language and vision domain, we have yet to see an attempt to develop FMs for cloud systems management (or known as cloud intelligence/AIOps). In this work, we explore the opportunities of developing FMs for cloud systems management. We propose an initial FM design (i.e., the FLASH framework) based on meta-learning and demonstrate its usage in the task of resource configuration search and workload autoscaling. Preliminary results show that FLASH achieves 52.3-90.5% less performance degradation with no adaptation and provides 5.5x faster adaptation. We conclude this paper by discussing the unique risks and challenges of developing FMs for cloud systems management.
Submission Number: 2
Loading