Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search

Xinzhe Li

Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search

Xinzhe Li

20 Sept 2025 (modified: 07 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, LLM Inference, Test-time Compute, Tree Search

Abstract: Test-time scaling enables large language models (LLMs) to improve performance on long-horizon reasoning tasks by allocating additional compute at inference. Tree-search–based approaches achieve state-of-the-art results in this setting, but they are notoriously inefficient—often at least an order of magnitude slower than simpler iterative methods. We introduce Chain-in-Tree (CiT), a plug-in framework that adaptively decides when to branch during search rather than branching at every step. CiT relies on lightweight Branching Necessity (BN) evaluation methods: BN-DP (Direct Prompting), where an auxiliary LLM directly judges whether a step requires branching, and BN-SC (Self-Consistency), which clusters multiple candidate actions to estimate agreement. We integrate CiT into three representative LLM-in-the-loop tree search (LITS) frameworks—Tree of Thoughts (ToT-BS), ReST-MCTS, and RAP—and evaluate across GSM8K and Math500. Our results show that: 1) BN-DP consistently reduces token generation, model invocations, and runtime by 75--85\% across all settings, with negligible accuracy loss and in some cases accuracy gains; 2) BN-SC typically yields substantial savings (up to 80\%) generally but shows instability in 1--4 out of 14 settings, caused by a small subset of examples that produce extremely long reasoning steps; 3) the quality of auxiliary LLMs is critical—not only the BN evaluator in BN-DP, but also the auxiliary models used in BN-SC for clustering (aggregator) and for pairwise equivalence checks. When these roles are filled by smaller LLMs (e.g., LLaMA-3-8B), performance degrades substantially. Importantly, BN-SC does not require LLMs at all in domains with deterministic action spaces (e.g., board games), where clustering can be performed with programmatic rules. Finally, we provide a theoretical guarantee that BN-DP never increases runtime relative to the baseline and release a unified implementation of CiT across ToT-BS, ReST-MCTS, and RAP with modular LLM-profiled roles to facilitate reproducibility and extension.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23541

Loading