Abstract: Large language models are increasingly deployed as \emph{protocols}: structured
multi-call procedures that spend additional computation to transform a baseline
answer into a final one. These protocols are usually evaluated only by
end-to-end accuracy, which reveals whether they deliver gains on average but gives
limited insight into when they help, when they hurt, and whether their
behavior transfers under distribution shift or composition.
We propose a \emph{paired-outcome measurement interface} for auditing a single
protocol step on exact-match tasks. For each instance, the interface records a
baseline correctness bit $E_0\in\{0,1\}$ and a post-step correctness bit
$E_1\in\{0,1\}$, with accuracies $p_t:=\Pr(E_t=1)$. This separates
\emph{correction}, $E_0{=}0\to E_1{=}1$, from \emph{corruption},
$E_0{=}1\to E_1{=}0$, through two conditional rates: the correction rate
$c=\Pr(E_1{=}1\mid E_0{=}0)$ and the corruption rate
$\gamma=\Pr(E_1{=}0\mid E_0{=}1)$. These two rates are sufficient to predict
accuracy changes and determine whether a step helps at a given baseline. They
also define a reusable empirical interface whose transfer can be tested across
seeds, difficulty mixtures, and composed pipelines.
We identify three mechanisms by which this interface can fail to transfer.
Under \textbf{mixture shift}, estimates of $(c,\gamma)$ pooled across
difficulty regimes become biased when calibration and deployment mixtures
differ; conditioning on depth identifies a regime variable under which the
interface becomes stable and enables predictive transfer, substantially
reducing this bias without additional model calls.
Under \textbf{presentation contamination}, selection
protocols can change the measured interface through stable presentation artifacts
even when candidate content is fixed. Finally, under \textbf{state
insufficiency}, the correctness bit alone may not carry enough history for
multi-step pipelines to compose predictably; a testable Markov factorization
characterizes when composition is valid and identifies where additional state
is needed when it is not.
When a protocol step passes these diagnostics, it becomes an auditable module:
it can be gated by estimated gain, conditioned on difficulty proxies to correct
mixture bias, and composed into multi-step pipelines with predictable accuracy.
We demonstrate these ideas on synthetic mathematical tasks with controlled
difficulty and on GSM8K using observable complexity proxies,
where the calibrated interface correctly predicts when protocol steps should be
activated or suppressed.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=IwqAuqs2Q2
Changes Since Last Submission: The previous submission was desk rejected for formatting. This resubmission corrects the TMLR formatting/template issues. The main manuscript PDF has been rebuilt in compliant TMLR style. The supplementary package has also been cleaned and reorganized for submission. The scientific content is unchanged: the paper proposes a paired-outcome measurement interface for auditing LLM protocol steps via two rates (correction and corruption), demonstrates three failure modes of transfer (mixture shift, presentation contamination, and state insufficiency), and provides diagnostic tools and deployment decision rules evaluated on synthetic depth-stratified tasks and GSM8K.
Assigned Action Editor: ~Ankit_Singh_Rawat1
Submission Number: 8626
Loading