GICA: The Gap-Index Compositional Arm Framework for Sample-Efficient Test-Time Scaling

TMLR Paper9725 Authors

13 Jun 2026 (modified: 20 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Test-time scaling (TTS) improves the reasoning capabilities of large language models (LLMs) by generating multiple candidate reasoning paths and using a verifier to select among them. Process reward models (PRMs), which score each intermediate step rather than only the final answer, yield stronger downstream accuracy but at a higher cost. Recently, PRMs that scale at test-time by generating long verification CoTs have been found to be more accurate at verification, but with a prohibitive cost that scales with both the number of paths and their length (number of steps), limiting scalability precisely where TTS is most beneficial. We recast reasoning-based process-level verification as a sample-efficient adaptive selection problem. We propose GICA (Gap-Index Compositional Arm framework), a bandit-based framework that exploits the compositional structure of reasoning paths to share information across related steps and identify the top-$K$ candidates. We establish theoretical correctness and a fixed-confidence sample-complexity bound, and validate GICA through synthetic experiments and in a TTS setup employing an end-to-end TTS pipeline across three mathematical reasoning benchmarks. We experiment with two open-weight math LLMs serving as generators and two LLMs as process-level, reasoning-based verifiers. GICA matches the accuracy of exhaustive process-level verification while substantially reducing verifier calls (by 4.2 $\times$) and inference runtime (by 4.3 $\times$), making fine-grained step-level supervision practical at scale. We open-source our code and data to facilitate future research: https://anonymous.4open.science/r/GICA-1B57.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lijun_Zhang1
Submission Number: 9725
Loading