Enhancing Policy Gradient for Traveling Salesman Problem with Data Augmented Behavior Cloning

Yunchao Zhang, Kewen Liao, Zhibin Liao, Longkun Guo

Published: 01 Jan 2024, Last Modified: 26 Jul 2025PAKDD (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The use of deep reinforcement learning (DRL) techniques to solve classical combinatorial optimization problems like the Traveling Salesman Problem (TSP) has garnered considerable attention due to its advantage of flexible and fast model-based inference. However, DRL training often suffers low efficiency and scalability, which hinders model generalization. This paper proposes a simple yet effective pre-training method that utilizes behavior cloning to initialize neural network parameters for policy gradient DRL. To alleviate the need for large amounts of demonstrations in behavior cloning, we exploit the symmetry of TSP solutions for augmentation. Our method is demonstrated by enhancing the state-of-the-art policy gradient models Attention and POMO for the TSP. Experimental results show that the optimality gap of the solution is significantly reduced while the DRL training time is greatly shortened. This also enables effective and efficient solving of larger TSP instances.