Multi-lingual Evaluation of Code Generation Models

Ben Athiwaratkun; Sanjay Krishna Gouda; Zijian Wang; Xiaopeng Li; Yuchen Tian; Ming Tan; Wasi Uddin Ahmad; Shiqi Wang; Qing Sun; Mingyue Shang; Sujan Kumar Gonugondla; Hantian Ding; Varun Kumar; Nathan Fulton; Arash Farahani; Siddhartha Jain; Robert Giaquinto; Haifeng Qian; Murali Krishna Ramanathan; Ramesh Nallapati; Baishakhi Ray; Parminder Bhatia; Sudipta Sengupta; Dan Roth; Bing Xiang

Multi-lingual Evaluation of Code Generation Models

Published: 01 Feb 2023, Last Modified: 26 May 2025ICLR 2023 notable top 25%Readers: Everyone

Keywords: code generation, execution-based evaluation, test-based evaluation, language models, multi-lingual code generation benchmark, code insertion, code summarization, robustness for code, code translation, zero-shot code translation, multi-lingual, mono-lingual, language models.

Abstract: We present two new benchmarks, MBXP and Multilingual HumanEval, designed to evaluate code completion models in over 10 programming languages. These datasets are generated using a conversion framework that transpiles prompts and test cases from the original MBPP and HumanEval datasets into the corresponding data in the target language. By using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/multi-lingual-evaluation-of-code-generation/code)

11 Replies

Loading