All in RLVR on Non-Verifiable Domains

Tao Tan; Howe Tissue; Lu Wang

All in RLVR on Non-Verifiable Domains

Tao Tan, Howe Tissue, Lu Wang

16 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLVR, Programmatic Rubrics, Non-verifiable, Open domain

TL;DR: We pioneer applying RLVR to open domains by using auto-generated "Judge Code" to provide sufficient, partial reward signals during RL training.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated great success on verifiable domains such as math and coding abilities for large language models (LLMs). However, open domains, with their subjectivity and lack of ground truth—have long been considered fundamentally challenging for RLVR, limiting its application. In this work, we challenge this view and pioneer a novel methodology to extend RLVR into open domains. We first reveal that a partial and imperfect sample-level reward is sufficient for RL under certain conditions. On the top of this discovery, we introduce sample-specific Judge Code as programmatic rubrics, which replaces the traditional reward model (RM) to evaluate LLM responses. Specifically, our methodology is centered on a Judge Code Generator (JCG), which programmatically translates evaluation rubrics into executable Judge Code for each sample. Judge code serves as a partial and computationally efficient instantiation of the evaluation rubric. The system supports two operational modes: in Online mode (On-JCG), it dynamically generates (Query, Judge code) pairs on-the-fly to create a reusable dataset for subsequent RL training; in Offline mode (Off-JCG), it directly leverages this pre-generated dataset to enable highly efficient, RM-free training. Through experiments, we demonstrate the promising potential of applying RLVR methods to open domains. Moreover, we particularly emphasize one of the key benefits brought by efficiency: compared with RM-based methods, specifically the generative reward model (GenRM), Off-JCG achieves more than 2x speedup in wall-time when reaching competitive performance. This work highlights a promising direction of reshaping the understanding of RLVR and open-domain research.

Primary Area: reinforcement learning

Submission Number: 7332

Loading