Keywords: LLM, GRPO, MCTS, Finetune
TL;DR: NaDRO employs unique preference and context-based rewards to effectively train LLMs on noisy data, enabling even smaller models to achieve superior performance in complex decision-making tasks.
Abstract: Group Relative Policy Optimization (GRPO) fine-tuning has been empirically shown to significantly enhance the reasoning abilities of language models. However, it often relies on large-scale, high-quality labeled data, which is typically difficult to obtain. To address this challenge, we introduce the Noise-Aware Dual-Reward Optimization (NaDRO) , which effectively enhances LLMs training in environments where data is noisy or imperfect. NaDRO operates through two key components: \textbf{(1) Preference-based Outcome Reward (POR)}, which extracts reliable preference signals from noisy data, guiding LLMs towards more effective decisions instead of relying on specific noisy scores; and \textbf{(2) a Context Perception Reward (CPR) mechanism}, which ensures that LLMs conduct necessary qualitative assessment of the current problem state, rewarding accurate judgments to foster better cognitive understanding before decision-making. In the context of combinatorial optimization problems, where dynamically selecting heuristic algorithms is challenging due to large problem scales and the difficulty of obtaining accurate decision data, we designed experiments to test our approach. Our results indicate that the fine-tuned Qwen 7B and Llama 3-8B models outperform mainstream large language models (LLMs) training in this task. Code is released at \url{https://anonymous.4open.science/r/NaDRO-D34D}
Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)
Submission Number: 1115
Loading