CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

ACL ARR 2026 January Submission5215 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Generation, Evaluation, Large Language Models

Abstract: The evaluation of Large Language Models (LLMs) for software engineering has shifted towards complex, repository-level tasks. However, existing benchmarks predominantly rely on coarse-grained pass rates that treat programming proficiency as a monolithic capability, obscuring specific cognitive bottlenecks. Furthermore, the static nature of these benchmarks renders them vulnerable to data contamination and performance saturation. To address these limitations, we introduce CoreCodeBench, a configurable repository-level benchmark designed to dissect coding capabilities through atomized tasks. Leveraging our automated framework, CorePipe, we extract and transform Python repositories into a comprehensive suite of tasks that isolate distinct cognitive demands within identical code contexts. Unlike static evaluations, CoreCodeBench supports controllable difficulty scaling to prevent saturation and ensures superior data quality. It achieves a 78.55% validity yield, significantly surpassing the 31.7% retention rate of SWE-bench-Verified. Extensive experiments with state-of-the-art LLMs reveal a significant capability misalignment, evidenced by distinct ranking shifts across cognitive dimensions. This indicates that coding proficiency is non-monolithic, as strength in one aspect does not necessarily translate to others. These findings underscore the necessity of our fine-grained taxonomy in diagnosing model deficiencies and offer a sustainable, rigorous framework for evolving code intelligence. Code of CorePipe framework and data of CoreCodeBench are available in https://anonymous.4open.science/r/CoreCodeBench-64E2/.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,evaluation,NLP datasets,corpus creation,automatic creation and evaluation of language resources

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 5215

Loading