Keywords: LLMs, RLVR, code generation
TL;DR: We propose a test synthesis method to help create a large algorithmic coding dataset with high-quality tests, and show that it significantly improves LLM post-training (i.e. reinforcement learning), demonstrating the importance of test quality.
Abstract: Verifiers provide important reward signals for reinforcement learning of large language models (LLMs).
However, it is challenging to develop or create reliable verifiers, especially for code generation tasks.
A well-disguised wrong solution program may only be detected by carefully human-written edge cases that are difficult to synthesize automatically. To address this issue, we propose HARDTESTGEN, an approach to synthesize high-quality test cases for algorithmic coding problems. We curate a comprehensive algorithmic programming dataset HARDTESTS with 26.6k problems and high-quality synthetic tests.
Compared with existing tests, \method tests demonstrate significantly higher accuracy in verifying LLM-generated code (+11.22 percentage points in precision, the percentage of actually correct code within the predicted correct ones). We also show that downstream post-training --- including rejection sampling and reinforcement learning (RL) --- using HARDTESTS verifier results in improved performance of LLM code generation. We open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 20770
Loading