GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty; Naman Jain; Jinjian Liu; Vijay Kethanaboyina; Koushik Sen; Ion Stoica

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica

Published: 18 Sept 2025, Last Modified: 23 Jan 2026NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: SWE-Bench, SWE Agents, Code Optimization

TL;DR: GSO: SWE Agents Struggle at Reasoning and Engineering for Software Optimization

Abstract: Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/gso-bench/gso

Code URL: https://github.com/gso-bench/gso

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 2128

Loading