Knowledge-to-Verification: Unlocking Reinforcement Learning with Verifiable Rewards for LLMs in Knowledge-Intensive Domains

05 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language models, Reinforcement Learning with Verifiable Rewards, Knowledge-Intensive Domains, Reasoning
TL;DR: We propose K2V, a framework that extends RLVR to knowledge-intensive domains and enabling verification of the model's reasoning process
Abstract: Reinforcement learning with verifiable rewards (RLVR) has demonstrated promising potential to enhance the reasoning capabilities of large language models in domains such as mathematics and coding. However, its application has not been effectively extended to knowledge-intensive domains due to the issue of unverifiable answers and the Limited of high-quality verifiable data. Furthermore, existing RLVR methods suffer from two inherent drawbacks. First, they focus solely on the final answer's correctness while ignoring the verification of reasoning process, which can lead to flawed reasoning. Second, this reliance on a final answer creates a sparse binary reward signal that destabilizes the training process. To address these challenges, we propose Knowledge-to-Verification (K2V), a framework that extends RLVR to knowledge-intensive domains and enabling verification of the model's reasoning process, without any human supervision. K2V is built on two key observations. First, structured knowledge is easier to verify than unstructured knowledge. Second, a complex reasoning process can be decomposed into a series of verifiable sub-tasks. Specifically, K2V first constructs a knowledge graph from text and then models the knowledge verification as a text-based knowledge graph completion task, thereby automatically synthesizing large-scale verifiable question-answering (QA) pairs. Then, K2V generates a checklist of sub-tasks for each QA pair. The model's reasoning process is verified by evaluating these sub-tasks, which in turn provides dense rewards. Extensive experiments demonstrate that K2V enhances the model's fundamental reasoning skills, which improves its reasoning capabilities in knowledge intensive domains while keeping its capabilities in general domains stable, or even slightly improved. Our work suggests that extending RLVR to knowledge-intensive domains through automated data synthesis is a promising direction. Meanwhile, verifying the reasoning process proves to be an effective method for overcoming the inherent drawbacks of RLVR. The code is available at https://anonymous.4open.science/r/k2v-C123.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2408
Loading