Late Code Chunking: A Code Chunking Strategy for Repository-Level Code Completion

ACL ARR 2026 January Submission3138 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: repository-level code completion, code completion, code generation, code chunking, retrieval-augmented code generation
Abstract: This paper introduces Late Code Chunking (LC$^2$), a chunking strategy designed to improve the semantic understanding of code segments for Large Language Models (LLMs). Repository-level code completion requires predicting the subsequent code for unfinished code by leveraging cross-file context spread across a repository. However, when retrieved fragments have missing semantics—the loss of structural or behavioral information during chunking—LLMs struggle to interpret the target code. To address this, LC$^2$ bifurcates chunks into a "Code Retrieval Context" optimized for similarity-based search and a "Code Comprehension Context" enriched via context expanding and augmentation. This dual-context design reduces information loss due to chunking and enhances the ability of LLMs to utilize retrieved code. Additionally, we introduce an Asymmetric Query-Chunk Sizing strategy to further enhance retrieval quality by minimizing query noise. Our experiments demonstrate that LC$^2$ provides robust performance enhancements, achieving a statistically significant 19.7\% improvement in Exact Match accuracy on the CrossCodeEval benchmark compared to the best existing chunking method.
Paper Type: Short
Research Area: Code Models
Research Area Keywords: code models, retrieval-augmented generation, prompting, code generation and understanding, chunking
Contribution Types: NLP engineering experiment
Languages Studied: Python, Java, C#
Submission Number: 3138
Loading