SWE-Bench+: Enhanced LLM Coding Benchmark

Haoran Xue; Gias Uddin; Song Wang

SWE-Bench+: Enhanced LLM Coding Benchmark

Haoran Xue, Gias Uddin, Song Wang

Published: 14 May 2026, Last Modified: 14 May 2026AIWare 2026 Benchmark and DatasetEveryoneRevisionsCC BY 4.0

Keywords: Benchmarking, Large Language Models, Issue Resolution

Abstract: SWE-Bench has become one of the most widely used benchmarks for evaluating whether large language models (LLMs) can resolve real-world GitHub issues. However, despite its broad adoption, a systematic evaluation of the quality of the benchmark remains lacking. In this paper, we present SWE-Bench+, an enhanced benchmark framework that improves evaluation reliability by addressing two benchmark-quality risks in SWE-Bench: solution leakage in issue descriptions and weak tests that allow plausible but incorrect patches to pass. To construct SWE-Bench+, we first analyze 217 commonly resolved issues across three top-performing agents in the SWE-Bench leaderboard during our study: SWE-agent 1.0, OpenHands+CodeAct v2.1, and AutoCodeRover-v2.0), yielding 651 model-generated patches, and identify five quality-problem patterns under the two broader risks of solution leakage and weak tests. Based on this analysis, we develop two components: SoluLeakDetector, which identifies solution-leaking content for issue sanitization, and TestEnhancer, which strengthens test suites for more reliable patch validation. Our results show that 60.83% of the commonly resolved issues exhibit solution leakage, and 77.88% are problematic overall. After removing leaked information from issue descriptions, model resolution rates drop substantially, showing that leakage materially inflates reported performance. SoluLeakDetector achieves 80.45% accuracy on solution-leak detection, and TestEnhancer identifies plausible patches for 97.11% of weak-test issues while reducing average resolution rates by 27.00 percentage points on Lite and 36.27 percentage points on Verified. These findings show that existing SWE-Bench evaluations can substantially overestimate true issue-resolution performance and that SWE-Bench+ provides a more rigorous benchmark framework for LLM-based software engineering agents.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 24

Loading