AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Ori Press; Brandon Amos; Haoyu Zhao; Yikai Wu; Samuel Ainsworth; Dominik Krupke; Patrick Kidger; Touqir Sajed; Bartolomeo Stellato; Jisun Park; Nathanael Bosch; Eli Meril; Albert Steppi; Arman Zharmagambetov; Fangzhao Zhang; David Pérez-Piñeiro; Alberto Mercurio; Ni Zhan; Talor Abramovich; Kilian Lieret; Hanlin Zhang; Shirley Huang; Matthias Bethge; Ofir Press

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language models, benchmark, program synthesis

TL;DR: We propose a new benchmark that tasks LMs with writing efficient code for widely used math, science and computer science functions.

Abstract: Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (SWE-Bench) and mathematics (FrontierMath). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 120 tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner achieves an average 1.58x speedup against reference solvers, including methods from packages such as SciPy, scikit-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.

Code URL: https://anonymous.4open.science/r/AlgoTuneCode-D1F2

Supplementary Material: pdf

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 1027

Loading