Track: long paper (up to 9 pages)
Keywords: code RAG, LLM, feature extraction, retrieval
TL;DR: Semantic code parsing for high-accuracy code RAG
Abstract: A code retrieval-augmented generation (RAG) framework that accepts natural language (NL) queries and generates responses from relevant code contexts is crucial for enhancing developer productivity. However, building a code RAG system is inherently challenging due to the hierarchical structure and complex semantics of source code, especially with resource-constrained infrastructures. To address this, we introduce CODE2JSON, a zero-shot technique that leverages LLMs for extracting NL representations from code via semantic parsing. CODE2JSON serves as a programming language (PL)-agnostic feature extractor. We evaluate CODE2JSON on six programming language —Python, Ruby, C++, Go, Java, and JavaScript—using approximately 125K records from eight widely used benchmark datasets, including HumanEval-X, MBPP, COIR, DS-1000, CSN, and ODEX. We examine the performance of CODE2JSON in different RAG setups for code retrieval and code generation tasks from NL queries. We explore nine retrieval models, encompassing sparse retrieval (e.g., BM25), text embeddings (e.g., BGE-Large), and code embeddings (e.g., CodeBERT), along with three LLMs: DeepSeekCoder-7B, Llama-3-8B, and Phi-2. Our findings indicate that CODE2JSON-assisted RAG outperforms the baseline approach in more than 50% of code retrieval and code generation tasks.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 31
Loading