Keywords: Large Language Models, Software Engineering, Benchmarks, Benchmark Curation, SWE-bench, SWE-bench Verified, Static Analysis
TL;DR: Some SWE-bench tests are bound to fail because they enforce conditions that aren't specified in the corresponding issue descriptions. We propose a lightweight algorithm that identifies these 'unfair' tests with a similar degree of accuracy to LLMs.
Abstract: Software engineering benchmarks are useful tools for evaluating the programming abilities of large language models (LLMs). In addition to ranking models against each other, they can help us to situate the current state of the art by leveraging real-world software engineering problems. For some benchmarks, this latter function is compromised by the presence of "unfair" tests, meaning tests that contain requirements not specified in the corresponding issue descriptions. Unfortunately, the manual identification of unfair tests is an expensive and time-consuming process; this is especially problematic for automated curation pipelines and continuously-updated benchmarks. There are promising LLM-based solutions, but these come with the usual drawbacks: complex scaffolding, prompt sensitivity, lack of reproducibility and environmental cost; in addition, low recall means the majority of unfair tests are unlikely to be identified. As an alternative to both manual and LLM-based approaches, we propose a lightweight, fully-deterministic, heuristic for the detection of unfair tests in software engineering benchmarks. We evaluate our heuristic against the human annotations used to curate SWE-bench Verified and we compare the results to the corresponding evaluations of two LLM-based alternatives (aligning our methods to facilitate a direct comparison). We find that the accuracy of our heuristic exceeds the accuracy of all non-fine-tuned configurations of both alternatives, but does not exceed the accuracy of a fine-tuned configuration. Given the additional effort, complexity and environmental impact associated with fine-tuning, we consider this to be a positive result. We further propose a version of our heuristic that is less precise, but more sensitive, exceeding the recall of both a fine-tuned, and non-fine-tuned, LLM-based alternative.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 21651
Loading