CodeEditorBench: Evaluating Code Editing Capability of LLMs

Published: 06 Mar 2025, Last Modified: 19 Apr 2025DL4C @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 9 pages)
Keywords: Large Language Model, Benchmark, Code Editing
TL;DR: This work introduces the CodeEditorBench, a pioneering evaluation framework designed to assess the performance of LLMs in editing code rigorously.
Abstract: Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, a pioneering evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curated diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluating 19 LLMs revealed that despite the relative consistency observed between the models' code editing and code generation abilities, notable differences persist.The results highlight the models’ limitations in code polishing and code rewriting as required and also indicate that models specifically tailored for code feedback capabilities show significant improvements in code editing tasks.CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release the dataset and evaluation code to enable the community to study code editing tasks of LLMs.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 65
Loading