Characterizing Job-Task Dependency in Cloud Workloads Using Graph Learning

Published: 01 Jan 2021, Last Modified: 28 Jul 2025IPDPS Workshops 2021EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Modeling and scheduling diverse and dynamic workloads effectively has become a crucial issue due to the ever increasing scale and complexity of systems and applications in modern data centers. A large-scale cloud system consists of a large number of computing nodes, storage nodes and networking devices, running diverse workloads. Existing works analyzed execution traces in terms of resource usage by applying statistical methods. Cloud workloads, especially batch jobs, are composed of tens to thousands of tasks with complex dependency which can be represented by directed acyclic graphs (DAGs). Those workloads and their dependencies have not been thoroughly studied. Understanding the characteristics of batch cloud workload helps us foresee resource demands and execution time of new jobs and make better decisions in job scheduling. In this paper, we investigate batch jobs in production cloud computing environments with dependencies from the perspective of topological characteristics and structural patterns. We design a graph learning approach for job classification based on jobs’ topological similarity. We evaluate our methods using traces collected from a production data center and discover insightful properties and patterns of batch jobs in real-world scenarios. To the best of our knowledge, this is the first such work that leverages graph learning to explore the topological structures for cloud workflow for characterization and analysis.
Loading