Keywords: Large Language Models, Combinatorial Optimization, Preference Learning, Process Reward
TL;DR: NaDRO employs unique preference and context-based rewards to effectively train LLMs on noisy data, enabling even smaller models to achieve superior performance in complex decision-making tasks.
Abstract: Group Relative Policy Optimization (GRPO) fine-tuning has demonstrated significant enhancements in reasoning tasks. However, it often relies on high quality labeled dataset, which is typically difficult to obtain. To address this challenge, we introduce \textbf{N}oise-\textbf{A}ware \textbf{D}ual-\textbf{R}eward \textbf{O}ptimization (\textbf{NaDRO}) to effectively enhances the training of Large Language Models (LLMs) under noisy or ambiguous supervision. NaDRO operates through two key components: \textbf{(1) Preference-based Outcome Reward (POR)},which makes a principled bias-variance tradeoff, reducing training variance by learning from robust preference rankings instead of overfitting to single-best estimates; and \textbf{(2) Context Perception Reward (CPR) mechanism}, which ensures that LLMs conduct necessary qualitative assessment of the current problem state to foster deeper situational understanding prior to decision-making. To validate our approach in a realistic decision-making testbed, we model classic combinatorial optimization problems like the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) as Markov Decision Processes, generating training data via cost-limited exploration. Our results demonstrate that the fine-tuned Qwen 7B and Llama 3.1-8B models achieve statistically robust performance, significantly outperforming leading LLM baselines and standard fine-tuning methods on these complex benchmarks. Code is released at \url{https://github.com/microsoft/HeurAgenix/tree/NaDRO}.
Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)
Submission Number: 1115
Loading