EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Code Generation, Efficiency
TL;DR: We propose EffiBench-X, a multi-language code efficiency benchmark, to address the gap in existing benchmarks primarily focusing on a single programming language (e.g., Python).
Abstract: Existing code generation benchmarks primarily evaluate functional correctness, with limited attention to code efficiency, and they are often restricted to a single language such as Python. To address this gap, we introduce EffiBench‑X, the first large‑scale multi‑language benchmark specifically designed for robust efficiency evaluation of LLM‑generated code. EffiBench‑X supports Python, C++, Java, JavaScript, Ruby, and Go, and comprises competitive programming tasks paired with human‑expert solutions as efficiency baselines. Evaluating state‑of‑the‑art LLMs on EffiBench‑X reveals that while models frequently generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM‑generated solutions (e.g., Qwen3‑32B) achieve only around 62% of human efficiency on average, with significant language‑specific variation: models tend to perform better in Python, Ruby, and JavaScript than in Java, C++, and Go (e.g., DeepSeek‑R1’s Python code is markedly more efficient than its Java code). These findings highlight the need for research into optimization‑oriented methods to improve the efficiency of LLM‑generated code across diverse languages. The dataset and evaluation infrastructure are publicly available at https://github.com/EffiBench/EffiBench-X.git and https://huggingface.co/datasets/EffiBench/effibench-x.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/EffiBench/effibench-x
Code URL: https://github.com/EffiBench/EffiBench-X.git
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 574
Loading