HARDTESTGEN: A High-Quality RL Verifier Generation Pipeline for LLM Algorithimic Coding

ICLR 2026 Conference Submission20770 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, RLVR, code generation
TL;DR: We propose a test synthesis method to help create a large algorithmic coding dataset with high-quality tests, and show that it significantly improves LLM post-training (i.e. reinforcement learning), demonstrating the importance of test quality.
Abstract: Verifiers provide important reward signals for reinforcement learning of large language models (LLMs). However, it is challenging to develop or create reliable verifiers, especially for code generation tasks. A well-disguised wrong solution program may only be detected by carefully human-written edge cases that are difficult to synthesize automatically. To address this issue, we propose HardTestsGen, an approach to synthesize high-quality test cases for algorithmic coding problems. We curate a comprehensive algorithmic programming dataset HardTests with 26.6k problems and high-quality synthetic tests. Compared with existing tests, HardTestsGen tests demonstrate significantly higher accuracy in verifying LLM-generated code (+11.22 percentage points in precision, the percentage of actually correct code within the predicted correct ones). We also show that downstream post-training — including rejection sampling and reinforcement learning (RL) — using HardTests verifier results in improved performance of LLM code generation.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 20770
Loading