Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

ACL ARR 2026 January Submission8408 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Quality-Aware Reinforcement Learning, LLM reasoning, RLVR
Abstract: Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic and puzzles. However, existing benchmarks evaluate only correctness, overlooking optimality—the ability to find the best solutions under constraints. We propose \data, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. \data provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility (Success Rate) and quality (Quality Ratio); and quality-aware rewards enabling continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1\% SR and 46.6\% QR, significantly outperforming GPT-4o (29.6\% SR, 14.6\% QR). Beyond optimization, training on \data transfers to diverse tasks: mathematics (+2.2\%), logic (+1.2\%), knowledge (+4.1\%), and instruction-following (+6.1\%). Our analysis reveals quality-aware rewards improve solutions by 28.8\% over binary rewards, and task diversity drives generalization more than data quantity—offering insights into RLVR scaling for complex reasoning.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Discourse, Pragmatics, and Reasoning
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: english
Submission Number: 8408
Loading