Routing-Aware Inference for Improving Reasoning Consistency in Large Language Models

ACL ARR 2026 January Submission2377 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reasoning, inference-time methods, large language models, trajectory selection
Abstract: Inference-time variability remains a major source of error in large language models, particularly on multi-step reasoning tasks. While sampling-based methods such as self-consistency reduce this variability through aggregation, they rely on answer-level voting and do not explicitly regulate the selection of internal reasoning trajectories. This paper introduces routing-aware inference, a deterministic inference-time mechanism that selects among multiple generated reasoning trajectories based on representational agreement. The approach is motivated by a variance-reduction perspective: under stochastic decoding, correct reasoning trajectories tend to exhibit stable representational alignment, whereas erroneous trajectories diverge due to compounding early errors. By routing inference toward internally consistent trajectories, the method reduces variance-induced failures without increasing total token budgets or modifying model parameters. Extensive zero-shot experiments across six benchmarks spanning extractive, multi-hop, arithmetic, and multi-domain reasoning demonstrate consistent improvements over single-pass prompting, chain-of-thought prompting, and self-consistency under matched inference budgets. Ablation studies further show that substantial gains arise from structured trajectory selection rather than increased sampling alone. The proposed framework operates entirely at inference time, requires no training or external tools, and is compatible with both proprietary and open-weight models.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: reasoning, inference methods, robustness, large language models, question answering
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 2377
Loading