TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

TMLR Paper7235 Authors

29 Jan 2026 (modified: 06 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL’s success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem—false negatives—where verifiers wrongly reject correct model outputs. Our in‐depth study of the Big-Math-RL-Verified dataset reveals that over 38\% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose TinyV, a lightweight LLM-based verifier that augments rule-based methods with a compact LLM to dynamically detect false negatives and recover valid trajectories, thereby producing more accurate reward signals during RL training. Across multiple math‐reasoning benchmarks, integrating TinyV improves final model performance by up to 10\%, and reaches the peak performance of the rule-based verifier using fewer than 50\% of the training steps. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL tuning of LLMs.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=scsSbLLwAK
Changes Since Last Submission: Fixed font and format
Assigned Action Editor: ~Masashi_Sugiyama1
Submission Number: 7235
Loading