RAVR-S: State-Sensitive Verification and Repair for Trustworthy Rule-Governed LLM Dialogue

Yaroslav Pelekhov

RAVR-S: State-Sensitive Verification and Repair for Trustworthy Rule-Governed LLM Dialogue

Yaroslav Pelekhov

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Trustworthy AI, LLM verification and repair, state-sensitive dialogue, therapeutic dialogue, AI for mental health

Abstract: Large language models produce fluent therapeutic responses that frequently violate explicit domain constraints, especially across multi-turn dialogues in which the patient's state evolves. Existing refinement strategies (self-critique, diversity sampling, free-form self-refine) raise surface quality but give no transparency about which constraints they satisfy or violate, and they ignore dialogue-level state dynamics. We introduce RAVR-S, a state-sensitive verification-and-repair framework for rule-governed LLM dialogue. At each turn, a structured verifier scores the candidate response against a typed 58-predicate inventory and emits a proof object that lists satisfied and violated constraints. RAVR-S extends the base verify-and-repair loop (RAVR) with a state-transition estimator: a separate LLM call tracks a discrete patient-state vector $(\mathrm{trust}, \mathrm{distress}, \mathrm{fatigue}) \in \{0,1,2,3\}^3$, scores $K=3$ response candidates against the predicted state dynamics, and picks the best candidate before optional targeted repair. We evaluate RAVR-S in a two-stage human study. The Stage 1 screening (10 annotators, 1,440 judgments) places Self-Refine first at 73.5% mean win rate and RAVR second at 60.7%, with Regenerate$\times$3 (34.3%) and Vanilla (31.3%) trailing. Stage 2 narrows the comparison: 20 annotators with an advanced or doctoral degree in psychology (credential-verified) compare RAVR-S against the two strongest baselines on state-sensitive 3-turn therapeutic mini-dialogues. RAVR-S wins 94.7% of comparisons (excluding ties), at almost-perfect agreement (Gwet's AC1 $= 0.82$). Automated evaluation on TherapyBench, a new public benchmark of 288 multi-turn trajectories across four therapy modalities, shows that RAVR-S holds 90% trajectory adherence and a 100% policy-compliance rate across 8-turn sessions while keeping iatrogenic pressure at 4.2%. Self-Refine moves the other way under state pressure: its policy-switch rate falls to 16.7%, below the 50% Vanilla baseline, and its pressure rate is roughly $9\times$ higher than RAVR-S (0.375 vs. 0.042). Single-turn repair on 1,500 turns confirms consistent adherence gains (+16.7 points, $p < 0.001$), convergence in a single repair iteration, and interpretable predicate-level diagnostics. TherapyBench is released at https://anonymous.4open.science/r/TherapyBench-D547.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 289

Loading