The Capability Frontier: Benchmarks Miss 82% of Model Performance

Bradley Fowler; Ryan Smith; Daniel Thi Graviet; William Myers; Joshua Greaves; Narmeen Fatimah Oozeer; Antía García; Philip Quirke; Fazl Barez; Shriyash Kaustubh Upadhyay

The Capability Frontier: Benchmarks Miss 82% of Model Performance

Bradley Fowler, Ryan Smith, Daniel Thi Graviet, William Myers, Joshua Greaves, Narmeen Fatimah Oozeer, Antía García, Philip Quirke, Fazl Barez, Shriyash Kaustubh Upadhyay

Published: 08 Mar 2026, Last Modified: 17 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: Benchmarking, Routing, Routers, LLMs, Judging, Judges, Pareto Frontier, Probabilistic Graphical Models

TL;DR: This paper introduces methods for, and quantifies the quality and cost improvements which can be attained on widely used benchmarks through leveraging the diversity in responses both within and across LLMs.

Abstract: Existing benchmarks typically report accuracy for a single model on a single run. This systematically understates real-world LLM capabilities: (i) different models get different questions correct, allowing for ensembling gains, and (ii) given a budget, some models can be run multiple times to improve results. We introduce the concept of a Capability Frontier: a Pareto frontier for a set of models, characterizing the best achievable performance at each cost level. Our construction of the Capability Frontier corrects for two biases: underestimation from evaluating a single model on a single run, and overestimation from taking the maximum over several noisy models or runs. To understand the impact of these corrections, we study 21 LLMs across 16 widely used benchmarks (coding, reasoning, medicine, factuality, instruction following, and agentic tasks) and compare the performance of the Capability Frontier at matched cost to each benchmark's top-performing model. Correcting for single-model evaluation yields a 54% average accuracy improvement (reduction in error rate); additionally correcting for single runs yields an 82% improvement. Moreover, SOTA accuracy can be matched at 85% cost reduction on the Capability Frontier. These findings suggest that collective LLM capabilities are substantially underestimated, with immediate implications for both evaluation and deployment.

Presenter: ~Bradley_Fowler1

Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 19

Loading