Keywords: Benchmark Quality, LLM Evaluation
TL;DR: This paper offers a comprehensive empirical analysis of the quality of the SWE-bench dataset, uncovering critical issues that significantly impact the performance of LLM-based approaches for solving software-related problems.
Abstract: Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-Bench dataset, which comprises 2,294 real-world GitHub issues.
The SWE-Bench dataset has quickly become the most popular benchmark for evaluating LLMs in software engineering. It has been adopted by leading companies such as OpenAI, Anthropic, Google, and Meta to assess the coding capabilities of their models. Despite its central role in measuring the state-of-the-art performance of newly released LLMs, a systematic evaluation of the quality of SWE-Bench is still lacking.
In this paper, we addressed this gap by presenting an empirical analysis of two variants of SWE-Bench dataset (i.e., SWE-Bench Lite and SWE-Bench Verified), both of which have been validated by developers to ensure quality. We conducted a manual screening of instances where the top three models in the SWE-Bench leaderboard ( i.e., SWE-AGENT-v1.0, OPENHANDS + CODEACT-v2.1, and
AUTOCODEROVER-v2.0) successfully resolved issues by comparing the model-generated patches with the actual pull requests.
Our analysis reveals some critical issues with the SWE-Bench dataset: 1) 60.83% of the successfully resolved issues involve "solution leakage'', where the solutions were either directly provided or indirectly hinted at in the issue report or comments. 2) 47.93% of the resolved issues were incorrectly marked as resolved due to patches passing weak test cases, i.e., the tests were not sufficient to verify patch correctness; we refer to these insufficiently verified patches as "plausible patches''. When we filtered out these problematic issues, the resolution rate of the three agents dropped from 42.1% to 21.8% on average on SWE-Bench Lite and from 51.7% to 25.9% on average for SWE-Bench Verified.
The critical issues in the current SWE-Bench dataset motivated us to create a more rigorous evaluation benchmark, SWE-Bench+, by addressing solution-leak risks and enhancing test suites for patch validation. Specifically, we introduce SolutionLeakDetector, an LLM-based tool to filter issues with solution leaks, and TestEnhancer, an LLM-based approach to strengthen test suites and mitigate weak test problems. SolutionLeakDetector achieves 80.45% accuracy in solution-leak detection. TestEnhancer improves patch validation and identifies plausible patches for 97.11% of weak-test issues, leading to average resolution rate drops of 27.00 percentage points on SWE-Bench Lite and 36.27 on SWE-Bench Verified. Although we focus on SWE-Bench, our approach can be readily extended to other Software Engineering benchmarking datasets to support their evolution.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 14250
Loading