SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

ACL ARR 2026 January Submission7078 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: evaluation of code models, software engineering automation
Abstract: Evaluating software engineering capabilities has become a core component of modern language models (LMs); however, the key bottleneck hindering further scaling lies not in the scarcity of high-quality solutions, but in the lack of high-quality test suites. Test suites are indispensable both for synthesizing program repair trajectories and for providing precise feedback signals in reinforcement learning. Unfortunately, due to the high cost and difficulty of annotation, high-quality test suites have long been hard to obtain, while those automatically generated by LMs tend to be superficial and lack sufficient discriminative power. As a first step toward constructing high-quality test suites, we introduce SWE-Mutation, a test suite benchmark designed to evaluate the quality of test suites generated by LMs. The benchmark characterizes test suite discriminability by introducing systematically mutated solutions that attempt to ``fool'' the test suites and pass validation. We further propose an agentic, language-agnostic framework for automatically generating complex mutants. Our benchmark consists of 2,636 mutated variants derived from 800 original instances and includes a multilingual subset spanning nine programming languages. Experiments on seven LMs reveal that even DeepSeek-V3.1 achieves only 10.20\% verification and 36.15\% detection rates, highlighting the inadequacy of current LMs. Additionally, our agentic mutation strategy enhances realism, reducing average detection rates from 71.04\% to 39.81\% compared to conventional methods.
Paper Type: Long
Research Area: Code Models
Research Area Keywords: evaluation of code models, software engineering automation
Contribution Types: Data resources
Languages Studied: English
Submission Number: 7078
Loading