Test-Time Scaling via Metric Geometry for LLM Reasoning

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: test-time scaling, large language model, reasoning
TL;DR: We interpret LLM reasoning as maze solving to achieve efficient test-time scaling.
Abstract: Test-Time Scaling (TTS) methods improve the reasoning capability of large language models (LLMs) by generating multiple independent Chain-of-Thoughts (CoTs) and aggregating them via designed policies. Despite effective, this ensemble approach incurs expensive inference costs due to the repeated model calls. In this paper, we propose a physics-inspired framework that achieves the accuracy gains of multi-calls TTS within a single or few LLM calls. It conceptualizes LLM reasoning as navigating through a maze, a complex puzzle through which one has to find a path to achieve specific goals. The proposed $Maze$ paradigm embeds candidate exemplars and domain knowledge into a multiplex latent manifold and learns a high-dimensional metric space. In inference, $Maze$ metrics can identify a single or a few optimal paths; each path refers to an ordered sequence of exemplars, forming a few-shot prompt that guides the LLM to the correct answer. Empirically, in reasoning benchmarks including GPQA, MMLU-pro, GSM8K, MATH-500, and AIME, $Maze$ matches or exceeds the accuracy of the Best-of-$N$ strategies while reducing the computational cost by 60$\sim$80\%. These results support $Maze$ to be a principled geometric alternative to brute-force TTS, enabling low-latency, interpretable, and computation-efficient reasoning for complex tasks. We also advocate for an interesting width-depth equivalence in LLM reasoning under the $Maze$ framework: any solution achievable by many shallow trials can be attained by a suitably planned sequence of reasoning steps.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9081
Loading