
\input{fig/fig_ood}

\looseness=-1We empirically validate our risk monitoring approach on two tasks, outlier detection (\autoref{subsec:exp-ood}) and set prediction (\autoref{subsec:exp-sets}). The first experiment induces shifts by mixture sampling in order to demonstrate monitoring behaviour for explicit scenarios, while the latter is based on naturally occuring temporal shifts. Evaluating diverse, real-world datasets, we find that the method ensures both timely detection of risk violations and a controlled false alarm rate. We next outline our baselines and practical design choices, followed by each experiment in more detail (see also \autoref{app:sec-exp-design}). Our code is publicly available at \url{https://github.com/alextimans/risk-monitor}.

\paragraph{Baselines.} \looseness=-1We compare our primary monitoring approach, the wealth process $M_t(\psi)$ described by \autoref{eq:test-supermartingale}, to the following (empirical) risk tracking mechanisms:
\begin{itemize}[leftmargin=13pt, itemsep=0pt, topsep=0pt]
    \item[\emph{(i)}] An empirical estimate of the \emph{unobservable oracle} or true population risk $\gR_t(\psi)$, computed as ${\hat{\gR}_t(\psi) = \frac{1}{B_*} \sum_{b=1}^{B_*} z_{t, b}, \; z_{t, b} \sim P_t}$ for a batch draw $B_*$ of large size (\eg~$B_* = 1000$). We desire for $M_t(\psi)$ to emulate the monitoring behaviour of $\hat{\gR}_t(\psi)$ as closely as possible while controlling false alarms by \hyperref[thm:false-alarm]{Lemma~\ref{thm:false-alarm}}.
    
    \item[\emph{(ii)}] An empirical estimate of the running risk $\gR_r(\psi)$, accumulated over the data stream for a given time step $t$ as ${\hat{\gR}_r(\psi) = \frac{1}{t} \sum_{i=1}^{t} z_i}$. This is the risk quantity evaluated both by \cite{podkopaev2021tracking} as a tractable estimate of the running risk, and \cite{Feldman2022AchievingRC} directly as a rolling risk target (\autoref{subsec:connection-methods}). Note that since we are monitoring the more challenging instantaneous risk $\gR_t(\psi)$, the estimator $\hat{\gR}_r(\psi)$ is nominally void of any false alarm guarantees.
    
    \item[\emph{(iii)}] The summation wealth process illustrated in \autoref{sec:method}, given by ${M^{SUM}_{t}(\psi) = \sum_{i=1}^{t} \lambda_i (z_i - \epsilon)}$. This process retains the same false alarm guarantees as $M_t(\psi)$, but tends to be less adaptive as evidence is accumulated additively.
    
    \item[\emph{(iv)}] The \emph{predictably-mixed Empirical-Bernstein} wealth process from \cite{waudby2024estimating}, given by ${M^{EB}_{t}(\psi) = \prod_{i=1}^{t} \exp\{ \lambda_i \, (z_i - \epsilon) - v_i \, \rho(\lambda_i) \}}$ where $v_i = 4\,(z_i - \hat{\mu}_{i-1})^2$, $\rho(\lambda_i) = 1/4\,(- \log(1 - \lambda_i) - \lambda_i)$, and we use the \emph{predictable plug-in} betting rate $\lambda^{EB}_i = \min\left\{\sqrt{\frac{2 \, \log(2/\delta)}{\hat{\sigma}^2_{i-1} \, i \, \log(1 + i)}}, \frac{1}{2}\right\}$. A similar method is also derived by \cite{podkopaev2021tracking} to estimate confidence bounds on $\gR_r(\psi)$ in their problem setting, but we employ its direct form as a sequential test.
\end{itemize}

\paragraph{Choice of betting rate.} \looseness=-1We follow the growth rate optimality (GRO) condition outlined in \hyperref[thm:gro]{Definition~\ref{thm:gro}} to guide our choice of betting rate. Whereas selecting $\lambda_t$ based on direct wealth maximization is possible, the approach can be computationally expensive to re-evaluate for every candidate $\psi$ and time step $t$. Instead, we leverage a suggested approximation by \cite{waudby2024estimating}, yielding the closed-form expression 
\begin{equation*}
    \lambda^{AGR}_t = \max \left\{ 0, \min \left\{ \frac{\hat{\mu}_{t-1} - \epsilon}{\hat{\sigma}^2_{t-1} + (\hat{\mu}_{t-1} - \epsilon)^2}, \frac{1/2}{\epsilon} \right\} \right\},
\end{equation*}
where $\hat{\mu}_{t-1}$ and $\hat{\sigma}^2_{t-1}$ denote the estimated running mean and variance over $\{z_i\}_{i=1}^{t-1}$. Intuitively, the betting rate increases when the running mean is far from $\epsilon$, and is further amplified by a small variance. $\lambda^{AGR}_t$ is \emph{approximately} GRO \citep{shekhar2023near} and performs empirically similar to direct maximization. A range of other suitable bets is discussed in \cite{waudby2024estimating}, and we briefly touch upon this in \autoref{app:sec-exp-design}.

\paragraph{Batching, sliding window and burn-in.} \looseness=-1Instead of a data stream where samples arrive individually at every time step, we may also consider the arrival of small batches of size $B \ll B_*$, \ie~we sample $\{ (\vx_{t,b}, \vy_{t,b})\}_{b=1}^{B} \sim P_t$. The batch-wise evidence at every time step can be easily aggregated by, for instance, averaging, which tends to both reduce the variance of the tracking process and improve the detection delay $\tau(\psi) - \tau_*(\psi)$ with respect to the true risk even for small batches ($B=10$). Similarly, delays can be reduced by enhancing the adaptivity of any tracker via a sliding window of size $S$, wherein only the most recent observations for time steps $i \in [t-S, t]$ are considered. Intuitively, the observational history is truncated by discarding past information deemed irrelevant for the current shift environment. This renders the tracking process more reactive (\eg, via the betting rate parameters $\hat{\mu}_{t-1}, \hat{\sigma}^2_{t-1}$) but also increases sensitivity to the retained samples, heightening the chance of false alarms if the resulting evidence is misleading. The choice of $B$ and $S$ can sometimes be delicate, and results for different combinations are provided in \autoref{app:sec-exp-res}. Finally, we introduce an initial number of \emph{burn-in} time steps $t_{burn} = \left\lfloor 100/B \right\rfloor$ during which any risk tracker (aside of the true risk) does not test for risk violation but merely accumulates samples, in order to stabilize any running quantities such as $\hat{\gR}_r(\psi)$. 

\input{fig/fig_set}

\subsection{Monitoring the Total Error Rate for Outlier Detection}
\label{subsec:exp-ood}

\looseness=-1We first consider the task of outlier detection, and instantiate the threshold predictor from \autoref{eq:def-threshold-pred} as ${\hat{f}_{\psi}(\rvx) = \mathbbm{1}[\texttt{out}(\rvx) \geq \psi]}$, where $\texttt{out}: \gS \rightarrow [0,1]$ maps the predictor's output to a bounded outlier score and $\mathbbm{1}[\cdot]$ is the indicator function. When $\texttt{out}(\rvx) \geq \psi$ evaluates true we declare the sample an outlier. Given a classification setting, we define $\texttt{out}$ as the normalized entropy of the base model's predictive distribution $\hat{p}(\rvy \mid \rvx)$\footnote{This is merely one possible choice, and can be easily swapped for other scoring mechanisms satisfying the required bounds.}. The target risk to monitor is given by the \emph{total error rate}, accounting for both cases of inlier (false positives, FP) and outlier misclassification (false negatives, FN) via the loss variable
\begin{equation*}
\label{eq:ood-ter}
    \rz_t = 
    \begin{cases}
        1, & \text{if } \texttt{out}(\rvx_t) \geq \psi \; \text{ and }\; (\rvx_t, \rvy_t) \sim P_{in}, \hfill \text{ (FP)} \\
        1, & \text{if } \texttt{out}(\rvx_t) < \psi \; \text{ and }\; (\rvx_t, \rvy_t) \sim P_{out}, \; \text{ (FN)} \\
        0 & \text{else.}
    \end{cases}
\end{equation*}
$P_{in}$ and $P_{out}$ denote inlier and outlier distributions, and the shifting stream is characterized by a time-dependent outlier probability $\pi_{t}^{out}$ such that $(\vx_t, \vy_t) \sim (1 - \pi_{t}^{out})\,P_{in} + \pi_{t}^{out}\,P_{out}$ for $t \in \gT$ is generated by mixture sampling. We consider three distinct shift settings: \emph{(i)} an \emph{i.i.d} stream where trivially $\pi_{t}^{out} = 0$ across all time steps; \emph{(ii)} an immediate stark outlier shift where $\pi_{t}^{out} = 1$ early on; and \emph{(iii)} a stepwise shift with $\pi_{t}^{out} \in \{0, 0.05, 0.1, \dots, 1 \}$ increasing every $t_{out}$ time steps. Risk parameters are set to common values $\epsilon = 0.1, \delta = 0.1$, and we simulate for $T = 1500$ steps. $P_{in}$ and $P_{out}$ are given by CIFAR-10 \citep{krizhevsky2009learning} and SVHN \citep{netzer2011reading} respectively, with a base classifier (ResNet-50) trained on CIFAR-10.

\looseness=-1Our results in \autoref{fig:ood-exp} for the stepwise shift assert that as the shift intensity increases, so does the number of risk-violating thresholds, leading to a gradual shrinkage of the $\psi$-CS towards zero. Among risk trackers the running risk $\hat{\gR}_r(\psi)$ emulates the true risk well but tends to misinterpret evidence, resulting in an undesirable number of false alarms. In contrast, all martingale-based trackers uphold the guarantee, at the cost of increased detection delays. Among them, the monitoring behaviour of the wealth process $M_t(\psi)$ most closely aligns with the true risk, striking a good trade-off. Results for other shift settings can be found in \autoref{app:sec-exp-res}, where as anticipated \emph{(i)} for the \emph{i.i.d} case most thresholds remain valid and the $\psi$-CS stabilizes over the full data stream; and \emph{(ii)} for the immediate shift all thresholds are rejected as soon as possible, correctly identifying $\hat{f}_{\psi}$ as highly unreliable.

\subsection{Monitoring the Miscoverage Rate for Set Prediction}
\label{subsec:exp-sets}

\looseness=-1Next we consider set prediction tasks on data subject to \emph{natural temporal shifts}, both for the classification and regression setting. 

\input{fig/fig_set_naval}

\paragraph{Functional Map of the World.}\looseness=-1For classification, we instantiate $\hat{f}_{\psi}$ as a set predictor of the form 
\begin{equation*}
\label{eq:set-pred}
    \hat{f}_{\psi}(\rvx) = \{ \vy \in \gY: \hat{p}(\rvy = \vy \mid \rvx) \geq \psi \},
\end{equation*}
and the base classifer once more returns a predictive distribution $\hat{p}(\rvy \mid \rvx)$ used to determine class inclusion in the set. A natural risk to monitor here is the \emph{miscoverage rate} with loss variable $\rz_t = \mathbbm{1}[\vy_t \notin \hat{f}_{\psi}(\rvx_t)]$, where $\vy_t$ denotes the true label. We consider the \emph{Functional Map of the World} dataset (FMoW) \citep{christie2018functional}, a large-scale satellite image dataset on building and land use over 16 years, and employ a time-dependent partitioning proposed by \cite{yao2022wild}. Therein a natural shift occurs as the \emph{same} satellite image locations capture land use changes over time. The classifier (DenseNet-121) is trained on the first 11 years, and we increase the test stream frequency by sampling chronologically from the final five years every 365 time steps (simulating daily observations). We again set $\epsilon = 0.1, \delta = 0.1$ and run for $T = 2000$ steps. 

\looseness=-1 Our results in \autoref{fig:set-exp} draw similar conclusions as in \autoref{subsec:exp-ood}, that is, the proposed wealth process $M_t(\psi)$ produces the lowest detection delays among all risk monitoring processes with false alarm control, while the running risk prematurely rejects some threshold candidates. Interestingly, the natural temporal shift induces a non-monotonic risk profile, wherein miscoverage for the second year slightly drops, but thereafter starkly increases. We elaborate on the connection between risk profiles and threshold behaviour in \autoref{app:sec-exp-design}, and provide complete results in \autoref{tab:app-fmow-natural}.

\paragraph{Naval propulsion system.}\looseness=-1For regression, the set predictor takes the interval form ${\hat{f}_{\psi}(\rvx) = [\hat{f}(\rvx) - \psi, \, \hat{f}(\rvx) + \psi]}$, with $\hat{f}(\rvx)$ returning point estimates. The target risk remains the miscoverage rate, and we consider predictive maintenance data on naval gas turbine behaviour \citep{cipollini2018condition}. This tabular time series consists of $\sim 12 \, 000$ recordings for various turbine system parameters, and an associated turbine compressor degradation coefficient denoting the compressor’s health. Over time this degradation coefficient steadily increases, denoting a gradual equipment decay. We train a Random Forest regressor on the initial `healthy' compressor state (enriched via jittered resampling) which expectedly fails to extrapolate as the degradation worsens, resulting in decreased performance in line with a temporal shift. Once more we have $\epsilon = 0.1, \delta = 0.1$ and run our monitoring process for the full time series.

\looseness=-1Our results in \autoref{fig:set-exp-naval} and summarized in \autoref{tab:app-uci-natural} draw consistent conclusions with other experiments. Specifically, we observe \emph{(i)} incurred false positives by the running risk, in particular for more adaptive tracking windows (smaller $S$); and \emph{(ii)} lowest detection delays for the wealth process $M_t(\psi)$ among all trackers with false alarm guarantees. Furthermore, the gap between running risk and wealth process remains fairy narrow under most realistic settings (\ie, small $B$ and large $S$). Visually, the $\psi$-CS (\autoref{eq:psi-cs}) stabilizes during the initial healthy state, but consistently shrinks as turbine compressor degradation and thus distributional shift worsens. The visualized threshold, being very small, displays sensitive risk behaviour even during early time steps.