Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

ICLR 2026 Conference Submission7987 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Flexible Job-shop Scheduling, Job-shop Scheduling, Offline Reinforcement Learning

TL;DR: We present Conservative Discrete Quantile Actor-Critic (CDQAC), an offline RL method that learn a SOTA constructive heuristics for FJSP and JSP, soley trained with Random data.

Abstract: The Job Shop Scheduling Problem (JSP) and Flexible Job Shop Scheduling Problem (FJSP) are combinatorial optimization problems with wide-ranging applications in industrial operations. In recent years, many online reinforcement learning (RL) approaches have been proposed to learn constructive heuristics for JSP and FJSP. Although effective, these online RL methods require millions of interactions with simulated environments, and their random policy initialization leads to poor sample efficiency. To address these limitations, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), a novel offline RL algorithm that learns effective scheduling policies directly from datasets, eliminating the need for training in a simulated environment, while still being able to improve upon suboptimal training data. CDQAC couples a quantile-based critic with a delayed policy update, estimating the return distribution of each machine–operation pair rather than selecting pairs outright. Our extensive experiments demonstrate CDQAC's remarkable ability to learn from diverse data sources. CDQAC consistently outperforms the original data-generating heuristics and surpasses state-of-the-art offline and online RL baselines. In addition, CDQAC is highly sample efficient, requiring only 10–20 training instances to learn high-quality policies. Notably, CDQAC performs best when trained on datasets generated by a random heuristic, leveraging their wider distribution over the state space, to surpass policies trained on datasets generated by significantly stronger heuristics

Supplementary Material: zip

Primary Area: optimization

Submission Number: 7987

Loading