A Lightweight Heuristic for Detecting Unfair Tests in Software Engineering Benchmarks

A Lightweight Heuristic for Detecting Unfair Tests in Software Engineering Benchmarks

ICLR 2026 Conference Submission21651 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Software Engineering, Benchmarks, Benchmark Curation, SWE-bench, SWE-bench Verified, Static Analysis

TL;DR: Some SWE-bench tests are bound to fail because they enforce conditions that aren't specified in the corresponding issue descriptions. We propose a lightweight algorithm that identifies these 'unfair' tests with a similar degree of accuracy to LLMs.

Abstract: As existing software engineering benchmarks age, the danger of model contamination is driving interest in automated curation pipelines. Unfortunately, ensuring high-quality task instances without manual assessment is a significant challenge. An obvious choice is to use large language models (LLMs) in place of human annotators, but this comes with the usual drawbacks: complex scaffolding, prompt sensitivity, hyperparameter dependence, lack of reproducibility, and substantial environmental cost. We buck this trend by proposing a lightweight, deterministic algorithm for detecting overly-specific software tests, and in doing so we support the selection of high-quality benchmark tasks. We evaluate our heuristic against the human annotations used to develop SWE-bench Verified and we compare the resulting accuracy to the accuracies of two LLM-based alternatives. We find that the accuracy of our heuristic is slightly higher than the reported accuracy of all non-fine-tuned LLM configurations across both alternatives, and slightly lower than the reported accuracy of a fine-tuned model. Given the additional effort, complexity and environmental impact associated with fine-tuning, we consider this to be a positive result. We further propose a version of our heuristic that is less precise, but more sensitive, and we use it to highlight the importance of balancing precision and recall. Our work demonstrates that straightforward, analytical techniques remain a viable alternative to both manual, and LLM-based, benchmark curation methods.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 21651

Loading