Bridge-Coder: Transferring Model Capabilities from High-Resource to Low-Resource Programming Language

Bridge-Coder: Transferring Model Capabilities from High-Resource to Low-Resource Programming Language

ACL ARR 2024 December Submission2397 Authors

16 Dec 2024 (modified: 13 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Most LLMs universally excel at generating code for high-resource programming languages (HRPLs) like \texttt{Python}, a capability that has become standard due to the abundance of training data. However, they struggle significantly with low-resource programming languages (LRPLs) such as \texttt{D}, exacerbating the digital divide. This gap limits developers using LRPLs from equally benefiting and hinders innovation within underrepresented programming communities. To make matters worse, manually generating data for LRPLs is highly labor intensive and requires expensive expert effort. In this work, we begin by analyzing the NL-PL Gap, where LLMs' direct-generated LRPL data often suffers from subpar quality due to the misalignment between natural language (NL) instructions and programming language (PL) outputs. To address this issue, we introduce \textit{Bridge-Assist Generation}, a method to generate LRPL data utilizing LLM's general knowledge, HRPL proficiency, and in-context learning capabilities. To further maximize the utility of the generated data, we propose \textit{Bridged Alignment} to obtain \textbf{Bridge-Coder}. To thoroughly evaluate our approach, we select four relatively LRPLs: \texttt{R}, \texttt{D}, \texttt{Racket}, and \texttt{Bash}. Experimental results reveal that Bridge-Coder achieves significant improvements over the original model, with average gains of 18.71 and 10.81 on two comprehensive benchmarks, M-HumanEval and M-MBPP.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Code Generation, Multilingual Programming Language, Data Generation

Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 2397

Loading