A Prompt-Masked Pilot for History-Dependent Safety Degradation in Multi-Turn Conversational Agents
Keywords: agent evaluation, multi-turn conversational safety, synthetic pilot, unsupported-belief reinforcement, prompt-level safety scaffolding
TL;DR: A synthetic 12-turn conversational-agent pilot measures history-dependent unsupported-belief reinforcement under prompt-masked automated grading and reports replay and leakage diagnostics.
Abstract: Closed-loop conversational systems can accumulate unsupported premises across turns, producing a history-dependent failure boundary that single-turn evaluations can under-characterize. We present a prompt-masked 12-turn pilot protocol over nine matched synthetic stress-test personas, each paired across three policies: periodic grounding, one-time safety persona framing, and no intervention. A GPT-5.4/GPT-5.2 LLM- as-judge ensemble scores the resulting 27 matched conversations at the conversation level on a coded $-1$ to $4$ primary-confirmation rubric after intervention prompts, system messages, and reasoning blocks are removed. In this cohort, periodic grounding receives a lower mean primary confirmation score than matched control by $0.55$ points (95\% paired-bootstrap CI $[0.32, 0.73]$, exact two-sided paired sign-flip $p = 0.0078$). We did not detect a difference between periodic grounding and one-time persona framing (mean difference $-0.03$ points, 95\% paired-bootstrap CI $[-0.20, 0.18]$, exact two-sided paired sign-flip $p = 0.8359$); an exploratory control-versus-persona contrast points in the same direction as the primary comparison. A residual style-leakage probe predicts the three arms from masked assistant text above chance (55.6\% vs. 33.3\%, leave-one-persona-out, permutation $p=0.008$), so the evaluation is prompt-masked rather than fully blinded. The study remains limited by synthetic personas, automated judges, inference over only nine matched personas, and unresolved external validity; it reports lower automated unsupported-belief endorsement scores in a synthetic stress-test cohort, not clinical efficacy or deployment readiness.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 250
Loading