Auto-SWE-Bench: A Framework for the Scalable Generation of Software Engineering Benchmark from Open-Source Repositories

ICLR 2026 Conference Submission23143 Authors

20 Sept 2025 (modified: 26 Jan 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark, SWE Bench, Evaluation of Large Language Model, Coding Benchmarks, Coding Agent
Abstract: Benchmarks like SWE-bench have shaped the evaluation of Large Language Models (LLMs) on complex software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench Atlas, a fully automated framework for generating high-fidelity, large-scale, multilingual, and diverse real-world repository-level coding tasks from open-source GitHub projects.Unlike synthetic frameworks or manually curated sets, SWE-Bench Atlas introduces an end-to-end pipeline that continuously harvests live pull requests to capture a broad spectrum of real-world software engineering demands, including both bug fixes and feature requests. The framework operates via a five-stage automated pipeline: (1) a Sourcing Module that identifies high-quality pull requests across diverse languages; (2) a Neuro-Symbolic Dockerization System that utilizes tool-augmented and template-guided synthesis to enforce strict reproducibility; (3) a State-Differential Test Oracle Extraction that integrates Adaptive Log Parsing to verify both regressions and feature requests across heterogeneous build system; (4) an Automated Quality Assurance to ensure environmental determinism; and (5) a Hint-Guided Trajectory Synthesis module that converts model-breaking instances into high-value training data. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today’s strongest models perform as follows: claude-sonnet-4.5 (36.20% pass@10), gpt-5-2025-08-07 (34.57%), gemini/gemini-2.5-pro (16.89%), and gpt-4o (18.24%). Supplementing our findings, a public release of 500-task dataset, together with evaluation scripts is included in the supplementary material. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench Atlas instances yields measurable improvements on the SWE-bench Multilingual benchmark. By automatically producing dynamic, polyglot, and verifiable tasks, SWE-Bench Atlas enables scalable evaluation and advancement of AI coding and reasoning abilities of next-generation AI systems.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 23143
Loading