cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree
Abstract: Retrieval-Augmented Generation (RAG) has
become essential for large-scale code generation,
grounding predictions in external code
corpora to improve factuality. However, a critical
yet underexplored aspect of RAG pipelines
is chunking—the process of dividing documents
into retrievable units. Existing linebased
chunking heuristics often break semantic
structures, splitting functions or merging unrelated
code, which can degrade generation quality.
We propose chunking via Abstract Syntax
Trees (CAST), a structure-aware method that
recursively breaks large AST nodes into smaller
chunks and merges sibling nodes while respecting
size limits. This approach generates selfcontained,
semantically coherent units across
programming languages and tasks, improving
performance on diverse code generation tasks,
e.g., boosting Recall@5 by 4.3 points on RepoEval
retrieval and Pass@1 by 2.67 points on
SWE-bench generation. Our work highlights
the importance of structure-aware chunking for
scaling retrieval-enhanced code intelligence.
Loading