RepoFixEval: A Repository-Level Program Repair Benchmark From Issue Discovering to Bug Fixing

Tao Sun; Yang Yang; Xianfu Cheng; Jian Yang; Yintong Huo; Zhuoren Ye; Rubing Yang; Xiangyuan Guan; Wei Zhang; Hangyuan Ji; Changyu Ren; Mengdi Zhang; Xunliang Cai; Zhoujun Li

RepoFixEval: A Repository-Level Program Repair Benchmark From Issue Discovering to Bug Fixing

Tao Sun, Yang Yang, Xianfu Cheng, Jian Yang, Yintong Huo, Zhuoren Ye, Rubing Yang, Xiangyuan Guan, Wei Zhang, Hangyuan Ji, Changyu Ren, Mengdi Zhang, Xunliang Cai, Zhoujun Li

26 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code, Bug fix, Program Repair, LLM

TL;DR: We present RepoFixEval, an issue-aware benchmark for practical repository-level automatic program repair, and evaluate 16 LLMs' performance in fixing multi-file software bugs.

Abstract: Automatic Program Repair (APR) aims to automatically fix software bugs, playing an essential role in software development. While current research demonstrates that Large Language Models (LLMs) excel in file-level program repair, their effectiveness in repository-level program repair remains unexplored. Real-world software projects, which often consist of multiple files, present significant challenges for LLMs in identifying bugs and generating fixes due to the intricate project structures. To bridge this gap, we introduce RepoFixEval, a repository-level APR benchmark consisting of 160 real-world bug fixing suites from popular Python projects. RepoFixEval provides the original buggy programs, associated issue reports, corresponding fixes, and unit tests to verify the correctness of each fix. Based on the benchmark, we further propose a three-step evaluation framework for LLM-based APR tools, encompassing (1) discovering issues from execution failures, (2) localizing buggy code segments, and (3) generating code fixes. Experimental results highlight that LLMs struggle with organizing error messages during the issue discovery phase. We find that longer contexts positively affect performance, but only a few LLMs can effectively utilize extended context information at the 128K level. Some open-source LLMs demonstrate competitiveness with closed-source counterparts, yet even the best-performing GPT-4o only resolves 12.3\% of bugs. Our study reveals the capabilities and limitations of 16 LLMs in handling repository-level bugs, providing valuable insights for future research in this field.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7160

Loading