Mastering construction heuristics with self-play deep reinforcement learning

Qi Wang, Yuqing He, Chunlei Tang

Published: 01 Jan 2023, Last Modified: 25 Jul 2025Neural Comput. Appl. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Learning heuristics without expert experience to construct solutions automatically has always been a critical challenge of combinatorial optimization. It is also the pursuit of artificial intelligence to construct an agent with the planning ability to solve multiple problems simultaneously. Nonetheless, most current learning-based methods for combinatorial optimization still rely on artificially designed heuristics. In real-world problems, the environment’s dynamics are often unknown and complex, making it challenging to generalize and implement current methods. Inspired by AlphaGo Zero, we propose a novel self-play reinforcement learning algorithm (CH-Zero) based on the Monte Carlo tree search (MCTS) for routing optimization problems in this paper. Like AlphaGo Zero, CH-Zero does not require expert experience but some necessary rules. However, unlike other self-play algorithms based on MCTS, we have designed offline training and online reasoning. Specifically, we apply self-play reinforcement learning without MCTS to train offline policy and value networks. Then, we apply the learned heuristics and neural network combined with an MCTS to make inferences on unknown instances. Since we did not incorporate MCTS during training, this is equivalent to training a lightweight self-playing framework whose learning efficiency is much higher than the existing self-play-based methods for combinatorial optimization. We can employ the learned heuristics to guide MCTS to improve policies and take better actions at runtime.