Beyond Exponential Decay:Rethinking Error Accumulation in Large Language Models

Mikhail L Arbuzov, Sisong Bei, Ziwei Dong, Dmitri Kalaev, Alexey Shvets

Published: 04 May 2026, Last Modified: 07 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: A common pessimistic argument holds that autoregressive language models suffer exponential decay in correctness over long outputs: if each token has independent error probability $e$, then $(1-e)^n \to 0$ as $n$ grows. The argument is clean to state and widely cited. It is also brittle, and three lines of recent empirical work make the cracks visible. The first is that only a small subset of tokens---roughly $5\%$ to $10\%$ in the studies that have actually measured it---genuinely depends on long-range context; the rest get more predictable, not less, as context accumulates. The second is geometric: LLM embeddings organize into stratified low-dimensional manifolds, so once a model is working inside one semantic region it tends to stay there even when individual tokens slip. The third concerns what happens when models do err on the consequential tokens---errors turn out to be idiosyncratic across samples rather than systematic, which is why majority-vote ensembles recover so much accuracy. Pulling these together gives a two-rate model, $P(\text{correct}) \approx (1-e_{\text{key}})^k \cdot (1-e_{\text{non}})^{n-k}$, in which $k$ scales sublinearly with $n$ and $e_{\text{non}}$ approaches zero with sufficient context. The predicted decay is, at worst, stretched-exponential; often power-law; and when $k$ saturates at some task-specific $k_{\max}$, constant in $n$. A number of recent capabilities---anchor compression at $99\%$ context reduction, $128$K-token retrieval on consumer GPUs, self-consistency gains on reasoning benchmarks---then read as natural consequences of one structural fact rather than independent engineering wins: long-context reliability hinges on a handful of decision points, not on uniform per-token accuracy.