Keywords: llm, code, testgeneragtion
Abstract: As Large Language Models (LLMs) are increasingly applied to automate software development, their use for automatic test case generation has become a key area of research. However, existing benchmarks for evaluating LLMs fundamentally simplify the real-world testing challenge. They typically constrain the problem to either (1) reproducing known bugs at the repository level, or (2) generating tests for isolated code units, such as individual functions, detached from their broader project context. Both approaches fail to assess the crucial capability of LLMs for proactive, exploratory testing in projects defined by complex, cross-file dependencies.
To address this critical gap, we introduce TestExplora, the first systematic benchmark designed to evaluate the proactive defect discovery capabilities of LLMs at the repository level. Constructed from real-world pull requests, TestExplora challenges models to find bugs without any prior knowledge of bug manifestations. Our comprehensive evaluation, conducted in both black-box and white-box settings, reveals a stark capability gap. Even state-of-the-art models exhibit critically low success rates (e.g., GPT-5-mini: 17.56%, o3-min: 5.23%), and access to the full source code (white-box) yields only marginal improvement. Further Analysis reveals that existing models struggle mainly with assertion mismatches and misconfigured mocks. TestExplora thus establishes a principled foundation for advancing research towards the grand challenge of autonomous, repository-level defect discovery.
Primary Area: datasets and benchmarks
Submission Number: 16944
Loading