RLConfig: Run-time Configuration of Cluster Schedulers via Deep Reinforcement Learning

Published: 01 Jan 2021, Last Modified: 24 May 2024ISPA/BDCloud/SocialCom/SustainCom 2021EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Cluster schedulers provide a flexible resource sharing mechanism for short-running jobs, which are time-sensitive and occupy a majority of cloud jobs. The configuration of schedulers decides the resource allocation among jobs so that it is important to the jobs’ performance. The static and manual setting of the configuration is difficult to optimize the performance of different changing jobs in the cloud. In this paper, we propose a Deep Reinforcement Learning (DRL)-based run-time configuration tuning framework for cluster schedulers to automatically configure the schedulers according to the changing workloads and resource status, called "RLConfig". It includes two parts——an estimator to evaluate the static configurations via building a relationship between the configuration and job performance, and a DRL-based optimizer to select configuration considering the performance-influencing factors such as work-load status and available resources. We implemented RLConfig on YARN Capacity scheduler and validated its effectiveness with workloads derived from real jobs. The experiment results show that our framework reduces the average job latency by one times compared with the static configuration and 27.8% comparing to the Queue model, but the time-cost is significantly lower than the existing run-time configuration tuning framework.
Loading