Late Code Chunking: A Code Chunking Strategy for Repository-Level Code Completion

Late Code Chunking: A Code Chunking Strategy for Repository-Level Code Completion

ACL ARR 2026 January Submission3138 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: repository-level code completion, code completion, code generation, code chunking, retrieval-augmented code generation

Abstract: This paper introduces Late Code Chunking (LC$^2$), a chunking strategy designed to improve the semantic understanding of code segments for Large Language Models (LLMs). Repository-level code completion requires predicting the subsequent code for unfinished code by leveraging cross-file context spread across a repository. However, when retrieved fragments have missing semantics—the loss of structural or behavioral information during chunking—LLMs struggle to interpret the target code. To address this, LC$^2$ bifurcates chunks into a "Code Retrieval Context" optimized for similarity-based search and a "Code Comprehension Context" enriched via context expanding and augmentation. This dual-context design reduces information loss due to chunking and enhances the ability of LLMs to utilize retrieved code. Additionally, we introduce an Asymmetric Query-Chunk Sizing strategy to further enhance retrieval quality by minimizing query noise. Our experiments demonstrate that LC$^2$ provides robust performance enhancements, achieving a statistically significant 19.7\% improvement in Exact Match accuracy on the CrossCodeEval benchmark compared to the best existing chunking method.

Paper Type: Short

Research Area: Code Models

Research Area Keywords: code models, retrieval-augmented generation, prompting, code generation and understanding, chunking

Contribution Types: NLP engineering experiment

Languages Studied: Python, Java, C#

Submission Number: 3138

Loading