Prompt-Level Drift as an Operational Monitoring Problem: Schema Failure Cliffs and Judge-Version Risk in Artifact-Grounded Evaluation
Keywords: prompt drift, operational monitoring, LLM evaluation, schema violations, LLMs-as-judges, judge-version risk, reproducibility
TL;DR: We frame prompt-level drift as an operational monitoring problem and show that small prompt changes can trigger schema failure cliffs and substantial judge-version sensitivity in artifact-grounded evaluation.
Abstract: Prompt templates are routinely edited in deployed LLM systems and evaluation
pipelines, often treated as “non-functional” refactors. We reframe prompt-level
drift as an operational monitoring problem: minimal prompt perturbations can
trigger discontinuous schema violations, degrade instruction following, and even
shift conclusions when the judge prompt—the measurement instrument—changes.
Methodologically, we study these failures through an artifact-grounded evaluation
pipeline built on a fixed evidence snapshot, where preserved outputs, versioned
judge/model slices, processed records, and summary tables form a one-way audit
chain. Across six preserved judge/model slices under a baseline judge (three
judge models scoring outputs from the other two providers), reducing contract
salience from explicit to implicit increases schema hard-breaks (A structure= 0)
from 8/48 (16.7%) to 34/48 (70.8%) and drops the mean total score from 7.69
to 2.75 (max total = 10). We further show judge-version sensitivity on identical
preserved outputs: for one overlapping slice (ChatGPT-as-judge scoring Gemini
outputs under implicit triggers), the hard-break rate shifts from 8/8 (v0) to 0/8
(v1/v2), and the mean total score shifts from 0.0 (v0) to 8.0 (v1) to 10.0 (v2).
These patterns motivate CAO-style change control: pin prompt and judge versions,
monitor interface validity, and prohibit pooling metrics across judge versions.
Submission Number: 13
Loading