NP-ENGINE: EMPOWERING OPTIMIZATION REASON- ING IN LARGE LANGUAGE MODELS WITH VERIFIABLE SYNTHETIC NP PROBLEMS

Xiaozhe Li; Xinyu Fang; Shengyuan Ding; Linyang Li; Haodong Duan; Qingwen Liu; Kai Chen

NP-ENGINE: EMPOWERING OPTIMIZATION REASON- ING IN LARGE LANGUAGE MODELS WITH VERIFIABLE SYNTHETIC NP PROBLEMS

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, Kai Chen

17 Sept 2025 (modified: 08 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM; Reinforcement Learning with Verifiable Reward; Optimization Reasoning

Abstract: Large Language Models (LLMs) have shown strong reasoning capabilities, with models like OpenAI's O-series and DeepSeek R1 excelling at tasks such as mathematics, coding, logic, and puzzles through Reinforcement Learning with Verifiable Rewards (RLVR). However, their ability to solve more complex optimization problems—particularly NP-hard tasks—remains underexplored. To bridge this gap, we propose \method, the first comprehensive framework for training and evaluating LLMs on NP-hard problems. \method covers 10 tasks across five domains, each equipped with (i) a controllable instance generator, (ii) a rule-based verifier, and (iii) a heuristic solver that provides approximate optimal solutions as ground truth. This generator-verifier-heuristic pipeline enables scalable and verifiable RLVR training under hierarchical difficulties. We also introduce \bench, a benchmark derived from \data, specifically designed to evaluate LLMs' ability to tackle NP-hard level reasoning problems, focusing not only on feasibility but also on solution quality. Additionally, we present \model, a model trained via zero-RLVR with curriculum learning on Qwen2.5-7B-Instruct, which significantly outperforms GPT-4o on \bench and achieves SOTA performance with the same model size. Beyond in-domain tasks, we demonstrate that RLVR training on \data enables strong out-of-domain (OOD) generalization to reasoning tasks (logic, puzzles, math, and knowledge), as well as non-reasoning tasks such as instruction following. We also observe a scaling trend: increasing task diversity improves OOD generalization. These findings suggest that task-rich RLVR training is a promising direction for advancing LLM's reasoning ability, revealing new insights into the scaling laws of RLVR.

Primary Area: reinforcement learning

Submission Number: 9287

Loading