RepoDistill: Distilling Repository Knowledge through Compression-Aware Budget Allocation and Policy Optimization

ACL ARR 2026 January Submission1858 Authors

31 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Code Repository, Long-Code Understanding
Abstract: Large Language Models (LLMs) have achieved strong performance on many code-related tasks, yet they still struggle with repository-level scenarios where reasoning depends on long, noisy, and structurally complex contexts. While existing retrieval methods, including both similarity-based and graph-based approaches, can identify relevant code snippets, they often retrieve excessive contexts that intensify the "lost-in-the-middle" phenomenon and dilute model attention with redundant contexts. To address this, we present RepoDistill, a novel framework that integrates retrieval with learned budget allocation for fine-grained context compression. RepoDistill first employs a plug-and-play lightweight GraphRAG to retrieve context that follows logical flows. It then applies Compression-Aware Budget Allocation guided by Compression-Aware Policy Optimization, which formulates context management as a multi-step decision problem and learns allocation policies for contexts. Experiments show that RepoDistill outperforms baselines, achieving gains of up to +7.00 on SWE-QA, +24.4% on CoderEval, and +0.25 on LongCodeU. Furthermore, a compact 4B-parameter model trained with RepoDistill can serve as an effective context compressor for closed-source LLMs, reducing input tokens by up to 66\% while maintaining comparable performance. We release our code at https://anonymous.4open.science/r/RepoDistill-12B0.
Paper Type: Long
Research Area: Code Models
Research Area Keywords: NLP Applications, Language Modeling
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: Python, Java, English
Submission Number: 1858
Loading