Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: test-time scaling, aggregation, reasoning, tool use, open-ended math, medical reasoning, information-seeking, large language models
TL;DR: Aggregation-based test-time scaling works well in open-ended math and tool-augmented information-seeking, but generalizes unevenly across domains and largely fails in medical reasoning despite substantial rollout diversity.
Abstract: Aggregation-based test-time scaling has produced strong gains on competition
mathematics and code generation, but it remains unclear whether those gains
transfer beyond verifier-friendly benchmarks, under which task conditions
aggregation helps, and when **single-step aggregation (SSA)** is sufficient relative to
**recursive self-aggregation (RSA)**. We study these questions across structured reasoning,
knowledge-intensive reasoning, and medical reasoning, spanning proof-style
mathematics, expert-level STEM reasoning, social-science knowledge tasks,
BrowseComp-style information seeking, and both tool-free and tool-integrated
regimes. Aggregation is effective when sampled
trajectories contain recoverably complementary information: complementary
reasoning progress in structured reasoning, or complementary retrieved
evidence in tool-integrated knowledge/evidence-seeking.
structured reasoning and tool-integrated knowledge/evidence-seeking recover
48% and 57% of available headroom on average, whereas medical reasoning
without tools recovers only 21% despite a comparable multi-sample headroom between Pass@1 and Pass@8.
Tool use improves medical base performance, but aggregation remains weak
because trajectories more often reflect competing clinical interpretations
than composable intermediate progress. Within favorable regimes, aggregation
type also matters: RSA is most useful for open-ended proof generation,
whereas SSA captures
most of the gain in tool-integrated knowledge/evidence-seeking at lower cost.
Aggregation value therefore depends jointly on task structure, tool access,
and aggregation type rather than following a uniform test-time scaling law
across domains.
Submission Number: 109
Loading