TL;DR: We theoretically transform numerical rewards into pairwise preference signals and integrate local search during fine-tuning, empirically enabling faster convergence and higher-quality solutions for COPs like TSP, CVRP, and scheduling.
Abstract: Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring expert knowledge. Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast combinatorial action spaces, leading to inefficiency. In this paper, we propose **Preference Optimization**, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling, emphasizing the superiority among sampled solutions. Methodologically, by reparameterizing the reward function in terms of policy and utilizing preference models, we formulate an entropy-regularized RL objective that aligns the policy directly with preferences while avoiding intractable computations. Furthermore, we integrate local search techniques into the fine-tuning rather than post-process to generate high-quality preference pairs, helping the policy escape local optima. Empirical results on various benchmarks, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP) and the Flexible Flow Shop Problem (FFSP), demonstrate that our method significantly outperforms existing RL algorithms, achieving superior convergence efficiency and solution quality.
Lay Summary: Many real-world tasks like planning delivery routes or scheduling jobs require finding the best arrangement among an enormous feasible solutions, but traditional reinforcement learning methods struggle as they receive ever-smaller numerical rewards and are consuming in exploring nearly infinite action space.
We introduce Preference Optimization, which compares pairs of candidate solutions to turn raw scores into simple preferrable signals. By aligning the learning process with these qualitative preferences and integrating local improvements into training rather than afterward, our method keeps learning process stable and guides the system more directly toward high-quality solutions.
This approach makes neural solvers learn faster, escape local optimal, ultimately find significantly better and more efficient solutions to these complex problems.
Link To Code: https://github.com/MingjunPan/Preference-Optimization-for-Combinatorial-Optimization-Problems
Primary Area: Optimization->Discrete and Combinatorial Optimization
Keywords: Reinforcement Learning, Combinatorial Optimization, Preference-Based Reinforcement Learning
Submission Number: 2452
Loading