How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

ICLR 2026 Conference Submission16317 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code LLMs;Benchmark;Evaluation;Test Case

Abstract: Code evaluation and reinforcement learning rely critically on test cases. However, collecting golden test cases is hard and expensive, motivating the use of LLMs for automatic test case generation. This, in turn, raises a pivotal challenge: how can we rigorously evaluate the quality of the generated test cases? Existing benchmarks often evaluate the exclusion ratio on large, unstructured collections of wrong codes, leading to high computational costs and severe score inflation. Furthermore, they inadvertently reward generators that detect common, trivial bugs, while failing to penalize their inability to identify rare yet critical faults. In this work, we connect two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a novel framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix, where rows represent wrong codes and columns represent test case results. The rank of this matrix plays a dual role. It specifies the minimal number of independent error patterns, which determines the size of wrong codes. It also provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity, which is defined as the average pairwise Jaccard similarity of the codes' failure signatures (i.e., the matrix rows). To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm combining pre-filtering and random-restart local search to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60\% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power and highlighting substantial room for future improvement.

Primary Area: datasets and benchmarks

Submission Number: 16317

Loading