Abstract: Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. These task instances containing code changes are paired them with relevant unit test files to ensure that the solution of each task instance can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new component and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs' automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, code generation and understanding
Contribution Types: Data resources
Languages Studied: English
Submission Number: 6451
Loading