
\input{fig/fig_infographic}

\looseness=-1The increasing demand for reliable predictions from machine learning systems has driven the development of statistical frameworks for \emph{distribution-free risk control} \citep{angelopoulos2021learn, bates2021distribution}. Such frameworks rely on data-driven inference to achieve their goal, leveraging representative held-out data to determine suitable parameters guiding an application-specific risk, \eg~selecting a threshold value for outlier flagging. The hope is that the user can then employ the determined settings indefinitely to aid in reliable decision-making. However, the common validation versus deployment mismatch in machine learning systems has the potential to thwart any `quality assurance' stamp these methods derive from their static inference. Challenges like outliers, distribution shifts and feedback loops are commonplace \citep{koh2021wilds}. In fact, \citet{van2023accurate} argue that an effective machine learning model should \emph{actively} affect the real-world---distribution shift is then not merely an artifact or deployment challenge, but rather a manifestation of a successfully operating system. Hence, any decision-making parameters necessitate \emph{continuous monitoring} during deployment, and the user should be notified when statistical reliability is faltering.

\looseness=-1We address this problem by proposing a general framework for the real-time continuous monitoring of bounded risks in evolving data streams, and raising a signal when desired risk levels are in danger of violation. Since alarm signals may trigger costly preventive measures, \eg~a production line stop in manufacturing or default loan denial in credit underwriting, it is crucial that false alarms are not raised too often, and our approach effectively controls this rate. We explicitly limit any assumptions on the deployment setting or nature of encountered data, rendering operability under arbitrary or \emph{unknown shifts}. To achieve our goal we adopt the `testing by betting' paradigm \citep{ramdas2023game}, and cast our monitoring task as a sequential hypothesis testing problem. Leveraging the framework's natural error control properties, our resulting monitoring procedure remains both efficient and statistically rigorous. To summarize, our contributions include:

\begin{itemize}[leftmargin=13pt, itemsep=0pt, topsep=0pt]
    \item  In \autoref{sec:method}, we motivate sequential testing as a natural approach to continuous risk monitoring, place it in the context of `testing by betting' and re-interpret the prior method of \citet{Feldman2022AchievingRC} under this lens.
    \item In \autoref{sec:theory}, we theoretically outline the statistical properties of our approach, including control over the false alarm rate, asymptotic consistency, and, under some conditions, bounds on the detection time of violations (\hyperref[thm:detection-delay-argument]{Prop.~\ref{thm:detection-delay-argument}}).
    \item In \autoref{sec:exp}, we demonstrate the efficacy of our approach against baselines for risk monitoring in outlier detection (\autoref{subsec:exp-ood}) and set prediction tasks (\autoref{subsec:exp-sets}), employing real-world datasets and different shift scenarios including natural temporal shifts.
\end{itemize}
