CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

ACL ARR 2024 June Submission5823 Authors

16 Jun 2024 (modified: 05 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the general large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing among natural language and different programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging graphical representation into informative knowledge for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into the model parameters via finetuning the models with the help of a pretrained GNN expert model. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Code Generation, Retrieval Augemented Generation
Contribution Types: NLP engineering experiment
Languages Studied: C++, Python
Submission Number: 5823
Loading