Probabilistic Chain-of-Thought: Sequential Bayesian Inference over Latent Reasoning Correctness

Probabilistic Chain-of-Thought: Sequential Bayesian Inference over Latent Reasoning Correctness

24 Apr 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: chain-of-thought, latent variable models, hidden Markov model, sequential representations, uncertainty quantification, posterior inference, calibration

TL;DR: We model LLM reasoning chains as an HMM over latent step correctness, using exact forward-backward inference to propagate uncertainty and derive a principled reflection policy that outperforms heuristic baselines.

Abstract: Chain-of-thought prompting elicits multi-step reasoning from large language models, yet existing approaches treat confidence at each step as an independent signal. This independence assumption contradicts the autoregressive generation process, wherein errors at early steps propagate forward and corrupt downstream outputs, creating epistemic blind spots where a model appears locally certain but is globally unreliable, motivating sequence-level probabilistic inference over latent reasoning correctness. We introduce \emph{Probabilistic Chain-of-Thought} (PCoT), which models a reasoning chain as a Hidden Markov Model over latent step correctness and performs exact posterior inference via the forward-backward algorithm. PCoT yields a principled answer confidence $C_{\mathrm{final}}$ and a posterior-driven reflection policy that dominates raw-score threshold rules under the model. On MATH and GSM8K, PCoT reduces Expected Calibration Error by $\mathbf{76\%}$ over the best heuristic baseline and improves accuracy by $\mathbf{14.7}$ percentage points at a $2\times$ token budget, while remaining robust across three confidence estimators. Our analysis of \emph{sequential contamination}---whereby a single upstream error suppresses posteriors of all downstream steps--- provides a formal explanation for why point-wise step scoring is insufficient for reliable reasoning evaluation.

Submission Number: 9

Loading