Abstract: Objective: To expose reasoning pathways of a reinforcement learning policy for Medicaid care coordination, develop an error taxonomy and implement fairness-aware guardrails. Design: Retrospective interpretability audit using attention analysis, Shapley explanations, sparse autoencoder feature discovery and blinded clinician adjudication. Setting: Medicaid care coordination programmes in Washington, Virginia and Ohio (July 2023-June 2025). Participants: 250,000 intervention decisions; 200 divergent cases reviewed by five clinicians. Main outcome measures: Calibrated harm prediction; algorithmic clearance and residual harm rates; error taxonomy frequencies; subgroup fairness metrics. Results: The conformal model achieved area under the receiver operating characteristic curve of 0.80 (95% CI 0.78 to 0.82), clearing 89.5% of decisions with 1.22% residual harm. Seven clinically coherent reasoning motifs were identified linking social determinants to clinical cascades. Premise errors (information gaps) were most frequent (42%), followed by reasoning errors (31%) and safety violations (27%). Post-hoc fairness adjustments reduced demographic parity gaps by 34%. Conclusions: Mechanistic interpretability transforms opaque algorithmic assistance into auditable decision support, providing a governance scaffold for clinical artificial intelligence deployment.
Loading