Keywords: Offline Reinforcement Learning, ML for Systems, RL for Systems
TL;DR: Offline RL-based load control for mitigating critical system failures.
Abstract: This paper introduces a load-shedding mechanism that mitigates metastable failures through offline reinforcement learning (RL). Previous studies have heavily focused on heuristics that are reactive and limited in generalization, while online RL algorithms face challenges in accurately simulating system dynamics and acquiring data with sufficient coverage. In contrast, our algorithm leverages offline RL to learn directly from existing log data. Through extensive empirical experiments, we demonstrate that our algorithm outperforms rule-based methods and supervised learning algorithms in a proactive, adaptive, generalizable, and safe manner. Deployed in a Java compute service with diverse execution times and configurations, our algorithm exhibits faster reaction time and attains the Pareto frontier between throughput and tail latency.