Keywords: test-time scaling, large language model, reasoning
TL;DR: We interpret LLM reasoning as maze solving to achieve efficient test-time scaling.
Abstract: Test-Time Scaling (TTS) methods improve the reasoning capability of large language models (LLMs) by generating multiple independent Chain-of-Thoughts (CoTs) and aggregating them via designed policies.
Despite effective, this ensemble approach incurs expensive inference costs due to the repeated model calls.
In this paper, we propose a physics-inspired framework that achieves the accuracy gains of multi-calls TTS within a single or few LLM calls.
It conceptualizes LLM reasoning as navigating through a maze, a complex puzzle through which one has to find a path to achieve specific goals.
The proposed $Maze$ paradigm embeds candidate exemplars and domain knowledge into a multiplex latent manifold and learns a high-dimensional metric space.
In inference, $Maze$ metrics can identify a single or a few optimal paths;
each path refers to an ordered sequence of exemplars, forming a few-shot prompt that guides the LLM to the correct answer.
Empirically, in reasoning benchmarks including GPQA, MMLU-pro, GSM8K, MATH-500, and AIME, $Maze$ matches or exceeds the accuracy of the Best-of-$N$ strategies while reducing the computational cost by 60$\sim$80\%.
These results support $Maze$ to be a principled geometric alternative to brute-force TTS, enabling low-latency, interpretable, and computation-efficient reasoning for complex tasks.
We also advocate for an interesting width-depth equivalence in LLM reasoning under the $Maze$ framework: any solution achievable by many shallow trials can be attained by a suitably planned sequence of reasoning steps.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9081
Loading