R$^{2}$-LLMs: Enhancing Test-Time Scaling of Large Language Models  with  Hierarchical Retrieval-Augmented MCTS

R$^{2}$-LLMs: Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS

ACL ARR 2025 May Submission5955 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce \textbf{R$^{2}$-LLMs}, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. \textbf{R$^{2}$-LLMs} enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the \textbf{coarse-level}, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the \textbf{fine-level}, during Monte Carlo Tree Search (MCTS), \textbf{R$^{2}$-LLMs} efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. \textbf{R$^{2}$-LLMs} is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the \textbf{MATH500, GSM8K, and OlympiadBench-TO} datasets achieve relative substantial improvement with an increase up to \textbf{24\% } compared to the baselines, showcasing the effectiveness of our approach in complex mathematical reasoning tasks.

Paper Type: Long

Research Area: Generation

Research Area Keywords: Test-Time Scaling, LLMs, Reasoning, RAG

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 5955

Loading