SWE-bench Goes Live!

Linghao Zhang; Shilin He; Chaoyun Zhang; Yu Kang; Bowen Li; Chengxing Xie; Junhao Wang; Maoquan Wang; Yufan Huang; Shengyu Fu; Elsie Nallipogu; Qingwei Lin; Yingnong Dang; Saravan Rajmohan; Dongmei Zhang

SWE-bench Goes Live!

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang

Published: 18 Sept 2025, Last Modified: 16 Jan 2026NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Code Benchmark

Abstract: The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a key benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench has become the dominant benchmark in this domain, it suffers from several limitations: it has not been updated since its release, is restricted to only 12 repositories, and relies heavily on manual effort for constructing test instances and setting up executable environments, significantly limiting its scalability. We present SWE-bench-Live, a live-updatable benchmark designed to address these limitations. SWE-bench-Live currently includes 1,890 tasks derived from real GitHub issues created since 2024, spanning 223 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Additionally, we introduce an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art models and agent frameworks on SWE-bench-Live, offering detailed empirical insights into their real-world bug-fixing capabilities. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live supports reliable, large-scale assessment of code LLMs and code agents in realistic development settings.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/SWE-bench-Live/SWE-bench-Live

Code URL: https://github.com/microsoft/SWE-bench-Live

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 1636

Loading