Keywords: Software Testing, Open source LLMs, Fine tuning for SWE tasks, Datasets, Benchmark
TL;DR: A novel pipeline for training open-source LLMs for generating bug reproduction tests.
Abstract: Software testing is crucial for ensuring the correctness and reliability of software systems. Automated generation of issue reproduction tests from natural language issue descriptions enhances developer productivity by simplifying root cause analysis, promotes test-driven development (TDD) - "test first, write code later", and can be used for improving the effectiveness of automated issue resolution systems like coding agents. Existing methods proposed for this task predominantly rely on closed-source LLMs (e.g., GPT-5, Claude Sonnet), with limited exploration of open-source models likely due to their weaker performance. To address this, we propose **SWE-Tester** - a novel pipeline for training open-source LLMs to generate issue reproduction tests. First, we curate a high-quality training dataset of **41K** instances from **2.6K** open-source GitHub repositories and use it to train LLMs of varying sizes and families. The fine-tuned models achieve absolute improvements of up to **10\%** in success rate and **21\%** in change coverage on SWT-Bench Verified. Further analysis shows consistent improvements with increased inference-time compute, more data and larger models These results highlight the effectiveness of our framework for advancing open-source LLMs in this domain.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 20462
Loading