Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

Baishali Chaudhury; Mengdie Flora Wang; Hyunji Hayley Park; Rahul Ghosh; Sungmin Hong; Jae Oh Woo

Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo

Published: 01 Apr 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM Reasoning BestPaperEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: logical reasoning, self-preference consistency, structural uncertainty, uncertainty decomposition, LLM evaluation

TL;DR: We measure how stably an LLM ranks its own reasoning candidates via self-preference; this structural consistency signal complements answer dispersion and identifies unreliable reasoning that dispersion alone misses.

Abstract: Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently---a failure mode especially prevalent in multi-step deductive reasoning. Existing methods assess reasoning reliability primarily through output dispersion---measuring how much sampled answers differ---but this view discards a complementary signal: whether the model can consistently rank competing reasoning candidates. We propose structural uncertainty, a consistency-aware evaluation framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. Given a query, we generate multiple candidate solutions and ask the same model to judge pairwise preferences among its own outputs. We aggregate sparse self-preferences into ranking distributions via Bradley--Terry modeling with PageRank, and decompose the signal into two complementary entropy-based components---across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable reasoning instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative. The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness on reasoning tasks---consistent with settings where multiple plausible solution paths remain competitive---while across-trial instability correlates negatively, signaling unreliable reasoning. Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 172

Loading