Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Xin-Qiang Cai; Wei Wang; Feng Liu; Tongliang Liu; Gang Niu; Masashi Sugiyama

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama

20 Sept 2025 (modified: 10 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Learning from noise, Reinforcement Learning, Weakly Supervised Learning, Post-Training

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary \{0,1\} during training. This choice carries a cost: it introduces *false negatives* (rejecting correct answers, FNs) and \emph{false positives} (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction $\tfrac{12}{36}$ as wrong when compared against the canonical $\tfrac{1}{3}$ due to brittle parsing/equivalence rules (FN), while a large language model (LLM) judges can be gamed by superficial cues or even a single adversarial token, yielding inflated correctness for wrong solutions (FP). We formalize verifier unreliability by modeling the verifier as a stochastic reward channel with asymmetric noise rates. From this abstraction, we derive two correction algorithms for verifier errors. The first is a *backward* correction that de-biases the observed binary reward to recover an *unbiased* estimator of the clean policy gradient. The second is a *forward* correction that reweights score-function terms so that the expected update direction aligns with the *clean gradient*; notably, it requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization (GRPO)-based RLVR pipeline and evaluate them on math-reasoning models and benchmarks. Across models and datasets, both corrections improve over uncorrected training; the forward variant converges faster and remains stable under heavier noise. Finally, we show a practical appeal mechanism in which a lightweight LLM verifier estimates the FN rate online by rechecking rule-based negatives, obtaining outperformance compared with other state-of-the-art contenders.

Supplementary Material: pdf

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 23032

Loading