Environment Free Coding Benchmarks: Evaluating Language Model Coding Capabilities without a Dedicated Environment

Published: 07 Jul 2025, Last Modified: 07 Jul 2025KnowFM @ ACL 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, code generation, language model
TL;DR: Introducing the EFCB benchmark that contains a higher variety and number of questions to evaluate language models on different coding tasks
Abstract: The increasing adoption of language models for coding tasks has prompted researchers to develop coding benchmarks to better assess and quantify a language model's coding abilities on a variety of tasks. Existing benchmarks effectively evaluate model code generation and understanding abilities, but typically require an external environment to verify code, which can slow down and complicate model evaluation. This paper presents the Environment-Free Coding Bencharks (EFCB) suite - a collection of 5,512 questions from real-world GitHub pull requests -that introduces multiple advantages relative to existing coding benchmarks: eliminating the need to use an external coding environment, a larger and more diverse question bank spanning different programming languages and industry use cases, and a multi-faceted collection of tasks that evaluate different indicators with respect to model coding abilities. By evaluating EFCB with o4-mini and Llama-3.3-70B as state of the art (SOTA) models, we observe that current SOTA models achieve approximately uniform performance across different programming languages and use cases, and we identify areas of improvement for existing SOTA models given that current EFCB results have not yet attained benchmark saturation.
Archival Status: Non-archival (not included in proceedings)
Submission Number: 64
Loading