A Taxonomy of Failing Bug Reproduction Tests on SWT-Bench

13 Feb 2026 (modified: 02 Apr 2026)Submitted to AIware 2026EveryoneRevisionsCC BY-SA 4.0
Keywords: Taxonomy, Bug Reproduction Testing, SWT-Bench, LLM4SE, Test Generation
TL;DR: Our taxonomy of bug reproduction test framework failures demonstrates that tool selection impacts downstream precision and coverage whereas model selection influences the quality and quantity of generated tests.
Abstract: Automatic Program Repair (APR) tools require high-quality test generation models to validate candidate patches. Despite the advancements of contemporary automated bug reproduction testing tools in resolving a greater quantity of issues, they continue to struggle with issues of higher complexity. These struggles may, in part, be caused by a framework’s general lack of problem-specific context. Furthermore, the reasons why these frameworks fail to resolve issues have never been categorized. This paper fundamentally contributes to resolving this issue by providing a taxonomy of bug reproduction test failures. This study makes use of the AssertFlip and OpenHands frameworks, the two highest-scoring open-source tools on the SWT-Bench Verified leaderboard. The significance of model selection on the causes of failure was also explored by utilizing GPT-5-mini and GPT-4o-mini on both tools. The error traces of these frameworks were then categorized to determine granular modes of failure. Through this process, the most prominent causes of failure included mechanical and implementation failures. Overall, the tool’s generation approach has a significant impact on the precision and coverage, while the selection of the model dramatically alters the quality and quantity of generated tests. Specifically, more powerful models with higher reasoning capability tended to present fewer failures due to a lack of technical or problem-specific context. Our experiment also shows evidence that our taxonomy can be used as concrete guidance to augment the issue description to improve the overall generation quality.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public.
Paper Type: Full-length papers (i.e. case studies, theoretical, applied research papers). 8 pages
Reroute: false
Submission Number: 36
Loading