Keywords: Superoptimization, Benchmarking Agents, Evoluationary Algorithms, Repository-level code synthesis
TL;DR: FormulaCode is a continuously updating benchmark that complements SWE-Bench for evaluating optimization agents (like AlphaEvolve)
Track: Long Paper (up to 9 pages)
Abstract: Rapid advances in LLM agents have shown the ability to optimize code using continuous objective functions — a significant leap beyond traditional code generation techniques. However, there is an urgent need for novel benchmarks that can effectively measure this capability and translate it into real-world impact. Current code benchmarks, which often rely on binary pass/fail outcomes, offer a limited evaluation framework that falls short of capturing the full potential of these emerging capabilities. To bridge this gap, we introduce FormulaCode, a novel benchmark designed for evaluating agentic superoptimization on large codebases, with a focus on real-world performance optimization. Constructed from a dataset of 451 real-world performance bottlenecks automatically mined from Github, FormulaCode enables comprehensive testing of an agent's ability to triage, diagnose, and resolve inefficiencies in realistic software environments. FormulaCode proves to be a challenging benchmark for frontier LLMs and agentic frameworks, with unrestricted repository exploration emerging as a principal component for finding performance inefficiencies. By introducing FormulaCode, our goal is to drive the development of next‑generation optimization algorithms that meet the rigorous demands of real‑world software projects.
Format: We have read the camera-ready instructions, and our paper is formatted with the provided template.
Supplementary Material: pdf
De-Anonymization: This submission has been de-anonymized.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 30
Loading