Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health

Pavel Dolin; Weizhi Li; Gautam Dasarathy; Visar Berisha

Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health

Pavel Dolin, Weizhi Li, Gautam Dasarathy, Visar Berisha

Published: 26 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 Position Paper TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: post-deployment monitoring, statistical hypothesis testing, covariate shift, concept drift, clinical AI, reliability

TL;DR: This paper argues that post-deployment monitoring in clinical AI is underdeveloped and proposes statistically valid and label-efficient testing frameworks to ensure reliability and safety.

Abstract: This position paper argues that post-deployment monitoring in clinical AI is underdeveloped and proposes statistically valid and label-efficient testing frameworks as a principled foundation for ensuring reliability and safety in real-world deployment. A recent review found that only 9\% of FDA-registered AI-based healthcare tools include a post-deployment surveillance plan. Existing monitoring approaches are often manual, sporadic, and reactive, making them ill-suited for the dynamic environments in which clinical models operate. We contend that post-deployment monitoring should be grounded in label-efficient and statistically valid testing frameworks, offering a principled alternative to current practices. We use the term "statistically valid" to refer to methods that provide explicit guarantees on error rates (e.g., Type I/II error), enable formal inference under pre-defined assumptions, and support reproducibility—features that align with regulatory requirements. Specifically, we propose that the detection of changes in the data and model performance degradation should be framed as distinct statistical hypothesis testing problems. Grounding monitoring in statistical rigor ensures a reproducible and scientifically sound basis for maintaining the reliability of clinical AI systems. Importantly, it also opens new research directions for the technical community---spanning theory, methods, and tools for statistically principled detection, attribution, and mitigation of post-deployment model failures in real-world settings.

Lay Summary: Artificial intelligence is increasingly used in hospitals to help doctors diagnose diseases and make treatment decisions. But once these AI tools are deployed, their performance can quietly decline over time — for example, as patient populations change, new equipment is introduced, or clinical practices evolve. Unfortunately, most medical AI systems are rarely checked after deployment: a recent review found that only 9\% of FDA-registered AI-based healthcare tools include a post-deployment surveillance plan. Our work argues that monitoring these systems should be treated rigorously. We propose using statistical testing — rigorous analysis used in medical trials — to continuously check whether an AI model’s accuracy is changing or whether the data it sees has shifted. By grounding post-deployment monitoring in statistical principles, we provide a clear path toward safer, more trustworthy AI in healthcare — and a foundation for new research on how to detect, explain, and fix model failures in real-world settings.

Submission Number: 602

Loading