Stop Reporting System-Level AI Reasoning as Individual Model Capability

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: agentic AI evaluation, system-level reasoning, multi-component systems, test-time compute, leaderboard attribution, distributed cognition, reporting standards
TL;DR: Frontier reasoning scores conflate model capability with system composition; we audit 26 leaderboard entries and propose a compute-normalized reporting protocol.
Abstract: GPT-5.5 scores 41.4% on Humanity's Last Exam. GPT-5.5 Pro, the same model with parallel test-time compute and tool access, scores 57.2%. Both appear on leaderboards as "GPT-5.5." That 15.8-point gap is not a model improvement; it is a system-composition difference wearing a model-capability label. This paper argues that frontier reasoning scores are produced by assemblies of models, test-time compute, tools, and verifiers, yet the field attributes them to a single model name. We develop a four-level hierarchy of inference complexity, present a pilot audit of 26 entries across 11 benchmarks revealing pervasive undisclosed system-level composition, ground the argument in distributed cognition theory, and propose a compute-normalized reporting protocol (SCRP) as a minimum standard for scientific credibility.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 273
Loading