Code2JSON: Can a Zero-Shot LLM Extract Code Features for Code RAG?

Aryan Singhal, Rajat Ghosh, Ria Mundra, Harshil Dadlani, Debojyoti Dutta

Published: 05 Mar 2025, Last Modified: 31 Jan 2026ICLR 2025 Workshop DL4CEveryoneRevisionsCC BY 4.0

Abstract: A code retrieval-augmented generation (RAG) framework that accepts natural language (NL) queries and generates responses from relevant code contexts is crucial for enhancing developer productivity. However, building a code RAG system is inherently challenging due to the hierarchical structure and complex semantics of source code, especially with resource-constrained infrastructures. To address this, we introduce CODE2JSON, a zero-shot technique that leverages LLMs for extracting NL representations from code via semantic parsing. CODE2JSON serves as a programming language (PL)-agnostic feature extractor. We evaluate CODE2JSON on six programming language —Python, Ruby, C++, Go, Java, and JavaScript—using approximately 125K records from eight widely used benchmark datasets, including HumanEval-X, MBPP, COIR, DS-1000, CSN, and ODEX. We examine the performance of CODE2JSON in different RAG setups for code retrieval and code generation tasks from NL queries. We explore nine retrieval models, encompassing sparse retrieval (e.g., BM25), text embeddings (e.g., BGE-Large), and code embeddings (e.g., CodeBERT), along with three LLMs: DeepSeekCoder-7B, Llama-3-8B, and Phi-2. Our findings indicate that CODE2JSON-assisted RAG outperforms the baseline approach in more than 50% of code retrieval and code generation tasks.