Abstract: With the rise of Language Models (LMs) and Large Language Models (LLMs), their potential for code editing (CE) has gained attention. An approach is to have LLMs generate draft code modifications, which are then refined by smaller LMs in further Code Editing Apply (CEA). However, CEA is error-prone, and existing benchmarks do not systematically evaluate LLM performance in handling these issues. We introduce FuseApplyBench, a benchmark designed to evaluate LLM performance across three major error types in CEA tasks. Atop FuseApplyBench's pipeline, we collect datasets to perform fine-tuning, enhancing code modifications' reliability (denoted as FuseApply). We benchmark FuseApply, four widely used open source LLMs, and Kortix-FastApply on FuseApplyBench. Results show that FuseApply significantly improves trustworthiness and accuracy metrics, while other models demonstrate weaker performance, highlighting opportunities for advancing LLM in CE.
External IDs:dblp:conf/issta/LiangZZZC0L25
Loading