Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols

26 Apr 2026 (modified: 10 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models are increasingly deployed as \emph{protocols}: structured multi-call procedures that spend additional computation to transform a baseline answer into a final one. These protocols are usually evaluated only by end-to-end accuracy, which reveals whether they deliver gains on average but gives limited insight into when they help, when they hurt, and whether their behavior transfers under distribution shift or composition. We propose a \emph{paired-outcome measurement interface} for auditing a single protocol step on exact-match tasks. For each instance, the interface records a baseline correctness bit $E_0\in\{0,1\}$ and a post-step correctness bit $E_1\in\{0,1\}$, with accuracies $p_t:=\Pr(E_t=1)$. This separates \emph{correction}, $E_0{=}0\to E_1{=}1$, from \emph{corruption}, $E_0{=}1\to E_1{=}0$, through two conditional rates: the correction rate $c=\Pr(E_1{=}1\mid E_0{=}0)$ and the corruption rate $\gamma=\Pr(E_1{=}0\mid E_0{=}1)$. These two rates are sufficient to predict accuracy changes and determine whether a step helps at a given baseline. They also define a reusable empirical interface whose transfer can be tested across seeds, difficulty mixtures, and composed pipelines. We identify three mechanisms by which this interface can fail to transfer. Under \textbf{mixture shift}, estimates of $(c,\gamma)$ pooled across difficulty regimes become biased when calibration and deployment mixtures differ; conditioning on depth identifies a regime variable under which the interface becomes stable and enables predictive transfer, substantially reducing this bias without additional model calls. Under \textbf{presentation contamination}, selection protocols can change the measured interface through stable presentation artifacts even when candidate content is fixed. Finally, under \textbf{state insufficiency}, the correctness bit alone may not carry enough history for multi-step pipelines to compose predictably; a testable Markov factorization characterizes when composition is valid and identifies where additional state is needed when it is not. When a protocol step passes these diagnostics, it becomes an auditable module: it can be gated by estimated gain, conditioned on difficulty proxies to correct mixture bias, and composed into multi-step pipelines with predictable accuracy. We demonstrate these ideas on synthetic mathematical tasks with controlled difficulty and on GSM8K using observable complexity proxies, where the calibrated interface correctly predicts when protocol steps should be activated or suppressed.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=IwqAuqs2Q2
Changes Since Last Submission: The previous submission was desk rejected for formatting. This resubmission corrects the TMLR formatting/template issues. The main manuscript PDF has been rebuilt in compliant TMLR style. The supplementary package has also been cleaned and reorganized for submission. The scientific content is unchanged: the paper proposes a paired-outcome measurement interface for auditing LLM protocol steps via two rates (correction and corruption), demonstrates three failure modes of transfer (mixture shift, presentation contamination, and state insufficiency), and provides diagnostic tools and deployment decision rules evaluated on synthetic depth-stratified tasks and GSM8K.
Assigned Action Editor: ~Ankit_Singh_Rawat1
Submission Number: 8626
Loading