SWE-Refactor: A Repository-Aware Benchmark for Evaluating LLMs on Real-World Code Refactoring

ICLR 2026 Conference Submission13782 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Code Refactoring, Large Language Models, Software Engineering, Java Projects
TL;DR: SWE-Refactor is a new benchmark addressing key limitations of existing refactoring datasets, enabling robust evaluation of LLMs on atomic and compound refactorings with automated correctness checks.
Abstract: Recent advances in Large Language Models (LLMs) have garnered significant attention for their applications in software engineering tasks. Among these tasks, code refactoring has its own unique challenges. Unlike code generation, refactoring requires precise changes that preserve program behavior while improving structure, making automated evaluation difficult. Existing refactoring benchmarks suffer from three key limitations: (1) they often focus on atomic refactoring types while missing more complex ones; (2) they contain noisy data with entangled, unrelated code changes, making it difficult to study LLM’s true refactoring capability accurately; and (3) they lack code repository and structural information to support realistic evaluations. To address these issues, we propose SWE-Refactor, a new benchmark for LLM-based code refactoring. SWE-Refactor contains 1,099 real-world, pure refactorings collected from 18 real-world Java projects. Each refactoring instance is verified through compilation, test execution, and automated refactoring detection tools to ensure correctness. Unlike prior benchmarks, SWERefactor covers both atomic and compound refactoring types (single and multiple code changes). It includes rich repository-level data (e.g., method callers and callees, class hierarchies), as well as configuration details like test coverage and build settings. We evaluate nine widely used LLMs on SWE-Refactor, including GPT-4o-mini, DeepSeek-V3, and CodeLLaMa. DeepSeek-V3 achieves the best performance with 457 successful refactorings (41.58%), followed by GPT-4o-mini with 438 (39.85%). DeepSeek-V3 performs particularly well on Extract Method, completing 301 cases, while GPT-4o-mini demonstrates stronger performance on more complex refactoring types, such as Move Method and Extract and Move Method. Furthermore, we find that adding retrieval context via few-shot examples and using a multi-agent workflow significantly improve performance, with the multi-agent approach achieving the highest success rate. We release SWE-Refactor and all evaluation results to support future research on LLM-based code refactoring.
Primary Area: datasets and benchmarks
Submission Number: 13782
Loading