Abstract: Large language models (LLMs) that iteratively revise their outputs—via chain-of-thought, self-reflection, or multi-agent debate—lack principled guarantees on the stability of their probability updates. We identify a consistent multiplicative scaling law governing how
instruction-tuned LLMs revise probability assignments over candidate answers: log q1(i) =α[log q0(i) + log b(i)] + c, where α is a belief revision exponent and b is evidence from verification. We prove that α < 1 is necessary and sufficient for asymptotic stability under iterated revision. Empirical validation across 4,975 problems, four graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, ARC-Challenge), and two primary model families (GPT-5.2, Claude Sonnet 4) yields α = 1.163 ± 0.084 with mean R2 = 0.76—models
exhibit near-Bayesian update behavior, slightly above the stability boundary. While single-step α exceeds 1, multi-step validation on 198 GPQA problems over 7 revision steps shows α decays from 0.84 to 0.54, yielding contractive long-run dynamics consistent with the stability theorem. Token-level logprob validation on 191 problems with Llama-3.3-70B confirms median α ≈ 1.0 for both logprob and self-reported elicitation. Decomposing the update into prior and evidence components reveals architecture-specific trust ratio fingerprints: GPT-5.2 exhibits balanced weighting (τ ≈ 1.0) while Claude shows slight evidence-favoring (τ ≈ 1.1). This work characterizes observable inference-time update behavior; it does not claim that LLMs internally perform Bayesian inference. The α-law provides a principled diagnostic for monitoring observable update quality in LLM inference systems.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Zhihui_Zhu1
Submission Number: 7716
Loading