Keywords: Large Language Models, Benchmark, Test Generation, Quality Assessment
TL;DR: Benchmark for evaluating the error correction capability and Discussion on the phenomenon of problem memorization
Abstract: Test generation is a critical component of automated code generation, yet existing benchmarks primarily evaluate generated tests using pass rates, overlooking test comprehensiveness and error-detection capabilities. We introduce TestJudge, a benchmark designed to evaluate both the quality and error-detection capabilities of generated unit tests. TestJudge contains 8,000 programming problems in Python and C++ sourced from Codeforces. For each problem, we provide 10 diverse code submissions with known correctness labels, where a generated test is considered valid only if it correctly classifies all 10 submissions according to ground-truth verdicts. Our evaluation of 13 state-of-the-art models using verdict matching rate and coverage metrics reveals significant challenges in current approaches. The best-performing model, Gemini-2.5-Pro, achieves verdict matching rates of only 59.75% for Python and 11.50% for C++. Notably, we observe a striking performance gap when comparing test generation versus direct problem-solving tasks on identical problems, with problem-solving success rates being considerably higher. This discrepancy suggests that models may rely on problem memorization rather than developing robust testing strategies, highlighting a critical limitation in current automated test generation approaches.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 23657
Loading