Bridge-Coder: Transferring Model Capabilities from High-Resource to Low-Resource Programming Language
Abstract: Most LLMs universally excel at generating code for high-resource programming languages (HRPLs) like \texttt{Python}, a capability that has become standard due to the abundance of training data. However, they struggle significantly with low-resource programming languages (LRPLs) such as \texttt{D}, exacerbating the digital divide. This gap limits developers using LRPLs from equally benefiting and hinders innovation within underrepresented programming communities.
To make matters worse, manually generating data for LRPLs is highly labor intensive and requires expensive expert effort.
In this work, we begin by analyzing the NL-PL Gap, where LLMs' direct-generated LRPL data often suffers from subpar quality due to the misalignment between natural language (NL) instructions and programming language (PL) outputs. To address this issue, we introduce \textit{Bridge-Assist Generation}, a method to generate LRPL data utilizing LLM's general knowledge, HRPL proficiency, and in-context learning capabilities. To further maximize the utility of the generated data, we propose \textit{Bridged Alignment} to obtain \textbf{Bridge-Coder}.
To thoroughly evaluate our approach, we select four relatively LRPLs: \texttt{R}, \texttt{D}, \texttt{Racket}, and \texttt{Bash}. Experimental results reveal that Bridge-Coder achieves significant improvements over the original model, with average gains of 18.71 and 10.81 on two comprehensive benchmarks, M-HumanEval and M-MBPP.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Code Generation, Multilingual Programming Language, Data Generation
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 2397
Loading