RepoLC: Repository-Level Code Completion with Light Compressor

RepoLC: Repository-Level Code Completion with Light Compressor

ACL ARR 2025 February Submission1967 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Current approaches commonly integrate repository-level code completion with retrieval-augmented generation. Specifically, private code repositories are utilized as retrieval databases, which aim to supply relevant code chunks to a large language model (LLM). However, incorporating multiple retrieved code chunks into an LLM will increase the cost of inference. This not only decreases the efficiency of the LLM but also impairs the user experience. To address this, we introduce $\textbf{RepoLC}$, which uses a $\textbf{L}$ight module to $\textbf{C}$ompress the retrieved code, thereby reducing the inference cost of LLMs. We insert a Semantic Compressor Encoder (SCE) between the retriever and the generator. Specifically, SCE compresses the retrieved code chunks into fewer high-level tokens and then projects them to the semantic space of the LLM. We propose a two-stage training scheme to train the overall pipeline through semantic alignment and task alignment. Experimental results demonstrate that our approach achieves significant improvements on multiple datasets. Compared to other methods, our approach incurs in minimal loss and achieves an inference time that is almost as efficient as that of in-file processing.

Paper Type: Long

Research Area: Generation

Research Area Keywords: Generation

Languages Studied: English

Submission Number: 1967

Loading