HardTests: Synthesizing High-Quality Test Cases for LLM Coding

Published: 22 Sept 2025, Last Modified: 25 Nov 2025DL4C @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, code generation, RLVR
TL;DR: We propose a test synthesis method and use it to create the largest coding competition dataset with high-quality tests. We further study how test quality impacts LLM post training. Abstract:
Abstract: Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate the largest competitive programming dataset HARDTESTS, consisting of 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also prove to be more effective for model training, measured by downstream code generation performance.
Submission Number: 68
Loading