The Capability Frontier: Benchmarks Miss 82% of Model Performance

Bradley Fowler; Ryan Smith; Daniel Thi Graviet; William Myers; Joshua Greaves; Narmeen Fatimah Oozeer; Antía García; Philip Quirke; Fazl Barez; Amir Abdullah; Shriyash Kaustubh Upadhyay

The Capability Frontier: Benchmarks Miss 82% of Model Performance

Bradley Fowler, Ryan Smith, Daniel Thi Graviet, William Myers, Joshua Greaves, Narmeen Fatimah Oozeer, Antía García, Philip Quirke, Fazl Barez, Amir Abdullah, Shriyash Kaustubh Upadhyay

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop DATA-FM OralEveryoneRevisionsCC BY 4.0

Keywords: Benchmarking, Routing, Routers, LLMs, Judging, Judges, Pareto Frontier, Probabilistic Graphical Models

TL;DR: This paper introduces methods for, and quantifies the quality and cost improvements which can be attained on widely used benchmarks through leveraging the diversity in responses both within and across LLMs.

Abstract: Existing benchmarks typically report accuracy for a single model on a single run. This systematically understates real-world LLM capabilities, particularly under heterogeneous data distributions: (i) different models get different questions correct according to their specializations, and (ii) given a budget, multiple generations can be sampled and selectively retained. To quantify this gap, we introduce the Capability Frontier: a Pareto frontier over a set of models that characterizes the best achievable performance at each cost level under optimal selection across models and generations (i.e., via an oracle). Our construction corrects for two opposing biases: underestimation from single-model evaluation and overestimation from taking maxima over noisy samples. We study 21 LLMs across 16 widely used benchmarks spanning coding, reasoning, medicine, factuality, instruction following, and agentic tasks, comparing Capability Frontier performance at matched cost to each benchmark’s top-performing model. Correcting for single-model evaluation yields a 54\% error rate reduction; additionally correcting for single runs yields an 82\% improvement, with SOTA accuracy matched at 85\% cost reduction. Complementing these empirical results, we use controlled probabilistic simulations to show that higher query topic entropy produces a near-monotonic increase in the performance gap between oracle routing and the best single model. Our findings suggest collective LLM capabilities are substantially underestimated, with implications for evaluation and deployment in data-heterogeneous, multi-domain settings.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 37

Loading