CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark

Lingyue Fu; Hao Guan; Bolun Zhang; Haowei Yuan; Yaoming Zhu; Jun Xu; Lin Qiu; ZongYu Wang; Xuezhi Cao; Xunliang Cai; Weinan Zhang; Weiwen Liu; Yong Yu

CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark

Lingyue Fu, Hao Guan, Bolun Zhang, Haowei Yuan, Yaoming Zhu, Jun Xu, Lin Qiu, ZongYu Wang, Xuezhi Cao, Xunliang Cai, Weinan Zhang, Weiwen Liu, Yong Yu

09 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Benchmark, Code Generation

Abstract: As Large Language Models (LLMs) demonstrate increasingly sophisticated code processing capabilities, evaluating their performance on engineering-level code remains challenging. Existing repository-level benchmarks primarily focus on single scenarios, such as code generation or bug fixing, without adequately capturing the diversity and complexity of real-world software or project engineering workflows. Furthermore, these benchmarks suffer from limited controllability in question positioning and reliability issues in their generated test cases. To address these limitations, we present CorePipe, a fully automated pipeline that converts repositories into comprehensive benchmark test cases, and introduce CoreCodeBench, a configurable multi-scenario repository-level benchmark. To simulate real engineering scenarios, CorePipe generates three types of atomic questions (Development, BugFix, and Test-Driven Development) specifically targeting core code segments. These atomic questions are further combined into three types of composite questions, with difficulty levels flexibly adjusted through hyperparameter tuning. CoreCodeBench provides a comprehensive and extensive repository-level benchmark to investigate the applicability of LLMs in real-world engineering projects. Experiments with 16 LLMs across diverse scenarios reveal varying capabilities and offer multi-dimensional insights into LLM performance in engineering contexts. Code of CorePipe and data of CoreCodeBench are available.

Croissant File: zip

Dataset URL: https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa

Code URL: https://github.com/tubehao/CoreCodeBench

Supplementary Material: pdf

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 858

Loading