Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

Timur Galimzyanov; Olga Kolomyttseva; Egor Bogomolov

Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

Timur Galimzyanov, Olga Kolomyttseva, Egor Bogomolov

Published: 22 Sept 2025, Last Modified: 25 Nov 2025DL4C @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RAG, retrieval, code search, code generation, code completion

TL;DR: We benchmark retrieval design choices for code-related tasks and provide practical recommendations for optimizing quality and efficiency in RAG systems under realistic compute budgets.

Abstract: We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena — code completion and bug localization — we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL→PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL→PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring ~100x larger latency. (3) Optimal chunk size scales with available context: 32–64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to ~200× across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality–latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.

Submission Number: 43

Loading