
\looseness=-1We next outline our approach to risk monitoring leveraging sequential hypothesis testing. We motivate how such a testing framework naturally arises by recasting our stream setting as a forecasting `game' between two agents---the forecaster and nature---and formalizing the collected evidence as an error accumulation process. The procedure is then placed in the context of sequential `testing by betting' \citep{ramdas2023game}, thereby enjoying the practicality as well as the rigour of the framework. Finally, we theoretically connect our approach to related methods in \autoref{subsec:connection-methods}.

\paragraph{A sequential forecasting game.} \looseness=-1Consider a game between two agents, the \emph{forecaster} and \emph{nature} (\ie~the environment). The forecaster provides a guess $\pi_t$ for the true risk at time $t$ given their knowledge of the observational history, formally encapsulated in the filtration $\mathcal{F}_{t-1} = \sigma(\{(z_1, \pi_1), (z_2, \pi_2), \dots, (z_{t-1}, \pi_{t-1})\})$ (refer \autoref{app:math-defn.-and-terminology} for technical definitions). Should the forecaster desire to minimize the mean squared prediction error ${\mathbb{E}_{P_t}[(\rz_t -  \pi_t)^{2}  \ \vert \ \mathcal{F}_{t-1}]}$, thei best guess is given by $\pi_t = \mathbb{E}_{P_t}[\rz_{t} \ \vert \ \mathcal{F}_{t-1}]$. Nature then reveals the value of $\rz_t$, leading to an observable \emph{discrepancy} $\delta_t = z_t - \pi_t$ representing the incurred forecasting error. As the game is repeated, a sequence of discrepancies $(\delta_t)_{t \in \mathcal{T}}$ is iteratively built. Crucially, if the forecaster continues to make their best guess at every step, the resulting discrepancy process forms a martingale difference sequence, \ie, $\mathbb{E}_{P_t}[\updelta_{t} \ \vert \ \mathcal{F}_{t-1}] = 0$ and hence asymptotically $\frac{1}{t}\sum_{i=1}^{t}\delta_i \rightarrow 0$ as $t \rightarrow \infty$. Thus, under the forecaster's best strategy asymptotic alignment between forecasts and actual outcomes is ensured, and systematic deviations in the discrepancies (\ie~error accumulation) can serve as evidence, or a testing signal, for such alignment.

\looseness=-1Adopting the game to our problem setting, assume the forecaster's guess is upper-bounded as $\mathbb{E}_{P_t}\left[\rz_t \ \vert \ \mathcal{F}_{t-1}\right] \leq \epsilon$. In general, nature has no obvious incentive to align its realizations of $\rz_t$ with the forecaster. However, in our setting the outcomes are directly affected by the choice of threshold $\psi$ since $\rz_t = \ell(\hat{f}_{\psi}(\rvx_t), \rvy_t)$, rendering the associated discrepancy process useful for testing. For each candidate $\psi \in \gPsi$, a formal hypothesis test on alignment at risk level $\epsilon$ can be formulated as
\begin{align}
\label{eq:hypotheses-formulation}
\begin{split}
    H_{0}(\psi) &: \mathbb{E}_{P_t}\left[\rz_{t} \ \vert \ \mathcal{F}_{t-1}\right]\leq\epsilon \; \forall t \in \gT \,\quad \text{(risk controlled)} \\
    \quad H_{1}(\psi) &: \exists t \in \gT: \mathbb{E}_{P_t}\left[\rz_{t} \ \vert \ \mathcal{F}_{t-1}\right] > \epsilon, \quad \text{(risk violated)}
\end{split}
\end{align}
and our game suggests that the threshold's discrepancy process $(\delta_{t})_{t \in \gT}$ can provide the necessary testing evidence. 

\paragraph{Example test statistic.} \looseness=-1Given the sequential test in \autoref{eq:hypotheses-formulation}, how should the discrepancy sequence be leveraged to construct a test statistic? A straightforward choice is the cumulative process $M_t(\psi) = \sum_{i=1}^{t} \lambda_i \cdot \updelta_i = \sum_{i=1}^{t} \lambda_i (\rz_i - \epsilon)$, where $\lambda_t$ denotes a non-negative weight associated with the `trust' placed in the aggregated evidence at time $t$, dictating how $M_t(\psi)$ evolves. Intuitively, if $H_0(\psi)$ is true and the risk is indeed bounded by $\epsilon$, then the forecaster's guesses should be well aligned and discrepancies exhibit little systematic effects. In that case, $M_t(\psi)$ forms a \emph{supermartingale}, meaning that it is not expected to increase since $
\mathbb{E}_{P_t} [\updelta_t \mid \mathcal{F}_{t-1}] \leq 0$. On the other hand, consistent evidence indicating the risk's growth beyond $\epsilon$ will accumulate and drive the growth of $M_t(\psi)$, signaling evidence for rejection in favour of $H_1(\psi)$. Thus the cumulative process provides an viable test statistic for \autoref{eq:hypotheses-formulation}, and we further expand on this approach in \autoref{app:sum-process}.

\paragraph{Test supermartingales and testing by betting.} \looseness=-1While the aforementioned summation statistic offers a valid testing procedure, it is not necessarily \emph{efficient} in the sense of optimally accumulating evidence. That is, we want to accumulate the necessary evidence as fast as possible should a risk violation occur. To that end, a rich body of literature on sequential testing through the lens of `testing by betting' can be leveraged \citep{ramdas2023game}. Specifically, rather than via summation we may consider the multiplicative accumulation of discrepancies as
\begin{align}
\label{eq:test-supermartingale}
    M_t\left(\psi\right) = \prod_{i=1}^{t}\left(1 + \lambda_{i}\cdot \updelta_{i}\right) = \prod_{i=1}^{t}\left(1 + \lambda_{i}\left(\rz_i - \epsilon\right)\right),
\end{align}
yielding a universal representation of a \emph{test supermartingale} if we ensure $M_0 = 1$ and $(\lambda_t)_{t \in \mathcal{T}}$ to be a \emph{predictable} process based only on past observations \citep{ramdas2024hypothesis}. That is, $\lambda_t$ may only depend on $\{z_i\}_{i=1}^{t-1}$ (and is thus measurable w.r.t. $\mathcal{F}_{t-1}$). A game-theoretic interpretation can be given to the sequential test and each component in \autoref{eq:test-supermartingale}\footnote{The evidence collection process $M_{t}(\psi)$ can be interchangeably referred to as a \emph{test (super)martingale} by its mathematical properties, \emph{wealth process} by its betting interpretation, or \emph{E-process} in the context of the sequential testing literature.}. The forecaster is actively betting against the null hypothesis starting from an initial wealth of $M_0 = 1$, and $M_t(\psi)$ describes the \emph{wealth process} at every subsequent betting round $t$. The betting rate $\lambda_t$ denotes the proportion of wealth gambled at each step, and $(z_t - \epsilon)$ the resulting pay-off once nature reveals $\rz_t$. Should $H_0(\psi)$ hold, then no betting strategy is expected to systematically increase wealth. On the other hand, a betting strategy resulting in meaningful wealth accumulation points towards evidence against the null. A rejection threshold can be employed to reach a final testing decision with \emph{stopping time} $\tau(\psi) \in \mathcal{T}$, denoting the time step at which $H_0(\psi)$ has been ruled out by the wealth process.

\paragraph{Constructing threshold confidence sets.} \looseness=-1Since the risk associated with every threshold needs to be monitored simultaneously, we instantiate a number of wealth processes $M_t(\psi)$ in parallel, one for each candidate $\psi \in \gPsi$. Their joint behaviour can be encapsulated in a \emph{confidence set} ($\psi$-CS) of valid thresholds at every time step $t$, constructed as
\begin{equation}
\label{eq:psi-cs}
    C_{t}^{\psi} = \{\psi \in \gPsi \ : \ M_{t}\left(\psi\right) < 1/\delta\}.
\end{equation}
That is, using the predefined risk control parameters $\epsilon, \delta$, the confidence set $C_{t}^{\psi}$ denotes the set of thresholds at time $t$ for which $H_0(\psi)$ has not yet been rejected. It is, in effect, the equivalent of the threshold set $\hat{\gPsi}_t \subseteq \gPsi$ described in \autoref{sec:background} using the particular rejection threshold $1/\delta$. Crucially, by leveraging the stopping rule $1/\delta$ and test martingale properties of $M_t(\psi)$, any threshold that does \emph{not} violate the risk level $\epsilon$ at time $t$ is guaranteed to be included in $C_{t}^{\psi}$ with high probability, \ie, it holds that $\mathbb{P}_{H_0}(\forall t \in \mathcal{T} \ : \ M_{t}(\psi) < 1/\delta) > 1-\delta$. We interpret this Type-I error control property as a \emph{false alarm guarantee} on erroneous rejection, and elaborate upon it in \autoref{sec:theory}. In addition, the size of the $\psi$-CS can be interpreted as an indicator for the stream's shift intensity, and thus the underlying model's deployment reliability. A constant set size indicates temporally stable threshold choices are available, whereas a shrinkage of $C_{t}^{\psi}$ towards zero implies that all thresholds eventually signal risk violation, necessitating a more substantial model update using the observational history. Since we are pre-occupied with risk \emph{monitoring} only, we leave the discussion on model updating, or \emph{safe adaptation}, for future work.

\paragraph{Practical considerations.} \looseness=-1An important distinction to stress is that any false alarm guarantee holds \emph{across time} for every threshold, and not \emph{across thresholds} at every time step. Thus no guarantees can be given on an adaptive strategy to select a particular $\hat{\psi}_t \in C_{t}^{\psi}$ at every step, unless multiple testing corrections (which we do not consider here) are introduced to control for the multi-stream setting, \eg~drawing inspiration from \cite{xu2024online, dandapanthula2025multiple}. Empirically, one may adopt strategies such as selecting a stable threshold that persists over extended time horizons or a threshold to maximize significant results (\eg~the minimum value). These choices also relate to the \emph{risk profile} of $\gR_t(\psi)$, which dictates if a `trivial' stable solution (that may be very conservative) is available, and facilitates the interpretability of the obtained $\psi$-CS. A preferable risk profile will behave both monotonically across time (\eg, $\lim_{t \rightarrow 0} \gR_t(\psi) = 0 \text{ and } \lim_{t \rightarrow T} \gR_t(\psi) = 1$) as well as across thresholds (\eg, $\lim_{\psi \rightarrow 0} \gR_t(\psi) = 0 \text{ and } \lim_{\psi \rightarrow 1} \gR_t(\psi) = 1$). However, we do not assume such conditions and our experiments in \autoref{sec:exp} address non-monotonic behaviour in either argument.

\subsection{Relation to Other Approaches} 
\label{subsec:connection-methods}
\looseness=-1We next draw connections to other notions of risk control in the literature. Most notably, we leverage our formulation in terms of discrepancy processes to provide a novel interpretation of rolling risk control \citep{Feldman2022AchievingRC} as an implicit, adaptive form of sequential testing with asymptotic guarantees (as opposed to finite-sample). We then contrast our shifting stream setting with the simpler \emph{i.i.d.} case.

\paragraph{Sequential testing and rolling risk control.} \looseness=-1Proposed by \cite{Feldman2022AchievingRC} as an extension of \cite{Gibbs2021AdaptiveCI} to bounded risks beyond the miscoverage rate, \emph{rolling risk control} (RRC) aims to track the running estimate $\frac{1}{t} \sum_{i=1}^{t} z_i$ and ensure its asymptotic adherence to the risk level $\epsilon$ via the update rule $\psi_t = \psi_0 + \sum_{i=1}^{t-1} \gamma \, (z_i - \epsilon)$, where $\gamma>0$ denotes a step size. The `calibration parameter' $\psi$ governs the behaviour of their set predictor, rendering it an instantiation of our threshold model $\hat{f}_{\psi}$ (see also \autoref{subsec:exp-sets}). Procedurally, the model is initialized with value $\psi_0$ and RRC incrementally updates the parameter at every time step following the rule. Leveraging our discrepancy process interpretation, we can directly observe that RRC accumulates evidence via discrepancies $\delta_t = z_t - \epsilon$ over time, and is mathematically analogous to the summation wealth process used as an example in \autoref{sec:method}. More formally, we can denote the process $\psi_{t} = \psi_{t-1} + \gamma \, \left(\rz_{t-1} - \epsilon\right)$, and under the null (\autoref{eq:hypotheses-formulation}) it follows that $\mathbb{E}\left[\psi_{t} \ \vert \ \mathcal{F}_{t-1}\right] \leq \psi_{t-1}$ indicates a risk-controlling parameter, whereas under the alternative $\mathbb{E}[\rz_t | \mathcal{F}_{t-1}] > \epsilon$, and $\psi_t$ thus grows as a martingale accumulating evidence against the null. The `testing by betting' interpretation helps clarify the key distinction to our approach---how the designed wealth process is subsequently utilized. Whereas we take a testing decision on the basis of a rejection threshold, RRC does not enforce such a stopping rule but re-invests the wealth in an update step to dynamically adjust prediction set sizes. While offering a convenient step towards model adaptation, the rule is tied to the explicit monotonicity assumption underlying RRC, wherein a larger $\psi_t$ reduces the risk by enlargening the prediction set and vice versa. Such monotonic behaviour in the thresholds is desirable, but not always available.

\paragraph{Risk control under the i.i.d. data stream setting.} \looseness=-1In the simple case where the test stream originates \emph{i.i.d} $(\vx_t, \vy_t)_{t \in \gT} \sim P_0$ rather than from time-dependent distributions $P_t$, we obtain that $\gR_t(\psi) = \gR_0(\psi)$ is a time-\emph{independent} risk, and the independence between samples further simplifies the risk definition (we detail our argument in \autoref{app:math-iid-stream}). We may then conveniently reverse the hypotheses pair from \autoref{eq:hypotheses-formulation} to form the test  
\begin{align*}
    H_{0}(\psi): \exists t \in \gT: \gR_0(\psi) > \epsilon, \; H_{1}(\psi) : \gR_0(\psi) \leq \epsilon \; \forall t \in \gT
\end{align*}
and define a reverse wealth process as $M_{t}\left(\psi\right) = \prod_{i=1}^{t}\left(1 + \lambda_{i}\left(\epsilon - \rz_i\right)\right)$. The corresponding $\psi$-CS construction is given by $C_{t}^{\psi} = \{\psi \in \gPsi \ : \ M_{t}\left(\psi\right) \geq 1/\delta\}$. Crucially, we can exploit the fact that $P_0$ is static and thus any drawn test conclusions on \emph{non-violation} of the risk (now formalized in $H_1(\psi)$) hold indefinitely in the future. In other words, $C_{t}^{\psi}$ will only grow as more thresholds are found to be safe, but never shrink. This permits leveraging the Type-I error control property of the wealth process to claim strong \emph{time-uniform} risk control guarantees of the form $\mathbb{P}\left(\forall t \in \gT \ : \ \gR_{0}\left(\psi\right) \leq \epsilon\right) \geq 1 - \delta$, moving beyond static risk control assurances. This approach has been leveraged directly by \cite{xu2024active} to extend \emph{RCPS} \citep{bates2021distribution} to the stream setting, and indirectly by \cite{adaptiveltt} to make \emph{Learn-then-Test} \citep{angelopoulos2021learn} more adaptive. Naturally, such forward-looking assurances continue to only hold for the setting of a \emph{static} distribution $P_0$, and are not applicable in our challenging shift setting.
