To aggregate or Not to Aggregate?  Test-Time Aggregation Beyond Verifier-Friendly Benchmarks

Shreyas Singh; Guduru Manoj; Pradeep Moturi; Kunal Singh

To aggregate or Not to Aggregate? Test-Time Aggregation Beyond Verifier-Friendly Benchmarks

Shreyas Singh, Guduru Manoj, Pradeep Moturi, Kunal Singh

Published: 30 May 2026, Last Modified: 30 May 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Track 1: Original Research/Position/Education/Attention Track

Keywords: test-time scaling, aggregation, reasoning, tool use, open-ended math, medical reasoning, information-seeking, large language models

TL;DR: Aggregation-based test-time scaling works well in open-ended math and tool-augmented information-seeking, but generalizes unevenly across domains and largely fails in medical reasoning despite substantial rollout diversity.

Abstract: Aggregation-based test-time scaling has produced strong gains on competition mathematics and code generation, but it remains unclear whether those gains transfer beyond verifier-friendly benchmarks, under which task conditions aggregation helps, and when **single-step aggregation (SSA)** is sufficient relative to **recursive self-aggregation (RSA)**. We study these questions across structured reasoning, knowledge-intensive reasoning, and medical reasoning, spanning proof-style mathematics, expert-level STEM reasoning, social-science knowledge tasks, BrowseComp-style information seeking, and both tool-free and tool-integrated regimes. Aggregation is effective when sampled trajectories contain recoverably complementary information: complementary reasoning progress in structured reasoning, or complementary retrieved evidence in tool-integrated knowledge/evidence-seeking. structured reasoning and tool-integrated knowledge/evidence-seeking recover 48% and 57% of available headroom on average, whereas medical reasoning without tools recovers only 21% despite a comparable multi-sample headroom between Pass@1 and Pass@8. Tool use improves medical base performance, but aggregation remains weak because trajectories more often reflect competing clinical interpretations than composable intermediate progress. Within favorable regimes, aggregation type also matters: RSA is most useful for open-ended proof generation, whereas SSA captures most of the gain in tool-integrated knowledge/evidence-seeking at lower cost. Aggregation value therefore depends jointly on task structure, tool access, and aggregation type rather than following a uniform test-time scaling law across domains.

Submission Number: 109

Loading