UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

ACL ARR 2024 December Submission1214 Authors

15 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rise of Large Language Models (LLMs) has fueled the development of coding agents designed to solve real-world code generation tasks. SWE-Bench has become a widely used benchmark for evaluating the code generation capabilities of these agents, using real-world problems derived from GitHub issues and their corresponding pull requests. However, the manually written test cases included in these pull requests are often insufficient, allowing some generated patches to pass the tests while failing to resolve the underlying issue. To address this challenge, we introduce UTGenerator, an LLM-driven test case generator that automatically analyzes codebases and dependencies to generate test cases for real-world Python projects. Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation. UTBoost ensures that the generated patch functions equivalently to the gold patch and passes the same test cases. In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous patches incorrectly labeled as passed in the original SWE Bench. These corrections significantly impact leaderboard rankings, affecting 40.9\% of entries in SWE-Bench Lite and 24.4\% in SWE-Bench Verified.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: automatic evaluation of datasets, benchmarking

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: English

Submission Number: 1214

Loading