xCodeEval: An Execution based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

Mohammad Abdullah Matin Khan; M Saiful Bari; Do Xuan Long; Weishi Wang; Md Rizwan Parvez; Shafiq Joty

xCodeEval: An Execution based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

Mohammad Abdullah Matin Khan, M Saiful Bari, Do Xuan Long, Weishi Wang, Md Rizwan Parvez, Shafiq Joty

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: programming-language, code, program-synthesis, automatic-code-repair, code-retrieval, code-translation, code-classification, execution, benchmark, multilingual, multitask, unit-test

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We introduce xCodeEval, the largest executable multilingual multitask benchmark to date consisting of 25M document-level coding examples from about 7.5 K unique problems covering up to 17 programming languages with execution-level parallelism.

Abstract: Recently, pre-trained large language models (LLMs) have shown impressive abilities in generating codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level, and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap with a reference code rather than actual execution. We introduce **xCodeEval**, the largest executable multilingual multitask benchmark to date consisting of $25$M document-level coding examples ($16.5$B tokens) from about $7.5$K unique problems covering up to $11$ programming languages with execution-level parallelism. It features a total of $7$ tasks involving code understanding, generation, translation and retrieval. **xCodeEval** adopts an execution-based evaluation and offers a multilingual code execution engine, **ExecEval** that supports unit test based execution in all the $11$ languages. To address the challenge of balancing the distributions of text-code samples over multiple attributes in validation/test sets, we propose a novel data splitting and a data selection schema based on the geometric mean and graph-theoretic principle. Our experiments with OpenAI's LLMs and open-sourced LLMs on the tasks and languages demonstrate **xCodeEval** to be quite challenging as per the current advancements in language models. Both [xCodeEval](https://github.com/ntunlp/xCodeEval) and [ExecEval](https://github.com/ntunlp/ExecEval) are freely available at [Hugging Face](https://huggingface.co/datasets/NTU-NLP-sg/xCodeEval) and [Github](https://github.com/ntunlp/ExecEval).

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4855

Loading