Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

Sophie Li; Nicholas Huang; Nayan Saxena; Nina Luo; Vincent Lin; Kevin Zhu; Sunishchal Dev

Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

Sophie Li, Nicholas Huang, Nayan Saxena, Nina Luo, Vincent Lin, Kevin Zhu, Sunishchal Dev

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficiency, pruning, sampling, Kullback–Leibler divergence

TL;DR: KL divergence guided sampling reduces memory usage and token generation without sacrificing accuracy.

Abstract: Large language models (LLMs) improve reasoning accuracy when generating multiple candidate solutions at test time, but standard methods like Best-of-N (BoN) incur high computational cost by fully generating all branches. Self-Truncation Best-of-N (ST-BoN) mitigates this by truncating unpromising paths early, but its reliance on consistency-based heuristics does not directly evaluate branch quality, which can limit efficiency on heterogeneous tasks. We present KL-Adjusted Pruned Path Algorithm (KAPPA), an inference-time method that combines Kullback–Leibler divergence, confidence, and entropy into a principled scoring function to guide progressive pruning. By promoting diversity during exploration and selectively eliminating low-scoring branches, KAPPA maintains accuracy while substantially reducing memory and token usage. Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct demonstrate that KAPPA stabilizes performance in smaller models and achieves up to ~60\% reduction in peak memory and ~90\% reduction in total token generation relative to BoN, with minimal impact on accuracy.

Submission Number: 172

Loading