TestJudge: A Rigorous Benchmark for Unit Test Generation and Quality Assessment

Zhaoqi Kuang; Sijun He; JingjingWu; Yang Chen; Jincheng Liu; Siqi Bao; Jintao Li; Shui Yu

TestJudge: A Rigorous Benchmark for Unit Test Generation and Quality Assessment

Zhaoqi Kuang, Sijun He, JingjingWu, Yang Chen, Jincheng Liu, Siqi Bao, Jintao Li, Shui Yu

20 Sept 2025 (modified: 24 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Benchmark, Test Generation, Quality Assessment

TL;DR: Benchmark for evaluating the error correction capability and Discussion on the phenomenon of problem memorization

Abstract: Test generation is a critical component of automated code generation, yet existing benchmarks primarily evaluate generated tests using pass rates, overlooking test comprehensiveness and error-detection capabilities. We introduce TestJudge, a benchmark designed to evaluate both the quality and error-detection capabilities of generated unit tests. TestJudge contains 8,000 programming problems in Python and C++ sourced from Codeforces. For each problem, we provide 10 diverse code submissions with known correctness labels, where a generated test is considered valid only if it correctly classifies all 10 submissions according to ground-truth verdicts. Our evaluation of 13 state-of-the-art models using verdict matching rate and coverage metrics reveals significant challenges in current approaches. The best-performing model, Gemini-2.5-Pro, achieves verdict matching rates of only 59.75% for Python and 11.50% for C++. Notably, we observe a striking performance gap when comparing test generation versus direct problem-solving tasks on identical problems, with problem-solving success rates being considerably higher. This discrepancy suggests that models may rely on problem memorization rather than developing robust testing strategies, highlighting a critical limitation in current automated test generation approaches.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 23657

Loading