Attentive Reinforcement Learning for Scheduling Problem with Node Auto-scaling

Yanxia Guan, Min Xie, Yuan Li, Xinhai Xu

2022 (modified: 18 Apr 2023)SMC 2022Readers: Everyone

Abstract: The distributed system Ray has attracted much attention in decision-making applications, which could greatly accelerate the training efficiency for intelligent algorithms. Task scheduling is one of the critical technologies in Ray, in which the number of resource nodes could be auto-scaling, i.e., automatically increasing or decreasing according to the workload. The adopted scheduling strategy is simple, which leaves much space to be optimized. In this paper, we consider designing a reinforcement learning method to optimize the scheduling problem in Ray. We propose an attentive reinforcement learning method, designing an attention-based state encoder that could efficiently extract the system state in the situation of the varying number of resource nodes. At the same time, an action mask mechanism filters invalid actions. Further, to improve the learning efficiency in the environment with the varied number of nodes, we design a curriculum learning method, which trains the method by gradually increasing the number of nodes in the scheduling process. Finally, we use the real data generated by the Alibaba Cluster Trace Program to test in the simulation platform CloudSim. The experimental results show that the proposed method effectively scales down the completion time of tasks compared to the original algorithm in Ray.

0 Replies