Abstract: The rise of Large Language Models (LLMs) has fueled the development of coding agents designed to solve real-world code generation tasks.
SWE-Bench has become a widely used benchmark for evaluating the code generation capabilities of these agents, using real-world problems derived from GitHub issues and their corresponding pull requests.
However, the manually written test cases included in these pull requests are often insufficient, allowing some generated patches to pass the tests while failing to resolve the underlying issue.
To address this challenge, we introduce UTGenerator, an LLM-driven test case generator that automatically analyzes codebases and dependencies to generate test cases for real-world Python projects.
Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation.
UTBoost ensures that the generated patch functions equivalently to the gold patch and passes the same test cases.
In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous patches incorrectly labeled as passed in the original SWE Bench.
These corrections significantly impact leaderboard rankings, affecting 40.9\% of entries in SWE-Bench Lite and 24.4\% in SWE-Bench Verified.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: automatic evaluation of datasets, benchmarking
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 1214
Loading