\onecolumn

\begin{center}
    {\Large\bfseries On Continuous Monitoring of Risk Violations under Unknown Shift}\\[0.5em]
    {\Large\bfseries --- Supplementary Material ---}
\end{center}

\label{appendix}

\tableofcontents

\newpage

\section{Mathematical Details}
\label{app:math}

We provide relevant mathematical details to complement the main text, including \emph{(i)} on the terminology of (super)martingales and the associated measure-theoretic objects, \emph{(ii)} a more detailed description of the summation wealth process, \emph{(iii)} insights on risk control under the \emph{i.i.d} data stream setting, and finally \emph{(iv)} formal proofs for our main theoretical statements.


\subsection{Definitions and Terminology}
\label{app:math-defn.-and-terminology}

Given a sequence of random variables $\rvu^{t} = (\rvu_1, \rvu_2, \ldots, \rvu_t)$, we denote the smallest $\sigma$-field generated by $\rvu^{t}$ as $\mathcal{F}_{t} = \sigma\left(\rvu^{t}\right)$. The sequence of random variables then lead to the filtration $\mathcal{F} = (\mathcal{F}_{t})_{t=0}^{\infty}$ defined as the increasing sequence of generated $\sigma$-fields $\mathcal{F}_{0} \subset \mathcal{F}_{1} \subset \mathcal{F}_{2} \subset \cdots$, where $\mathcal{F}_{0}$ is the trivial $\sigma$-field. A sequence of random variables $(M_t)_{t=0}^{\infty}$ is called a \textit{martingale} if it is adapted to the filtration $\mathcal{F}$, \ie~each $M_t$ is $\mathcal{F}_{t}$ measurable, each $M_t$ is integrable, and satisfies $\mathbb{E}\left[M_{t} \ \vert  \ \mathcal{F}_{t-1}\right] = M_{t-1}$. If this equality is replaced with $\leq$, then we call $(M_t)_{t=0}^{\infty}$ a \textit{supermartingale}. Furthermore, we define a sequence $(\lambda_{t})_{t=0}^{\infty}$ as a \textit{predictable} sequence if $\lambda_{t}$ is $\mathcal{F}_{t-1}$ measurable, meaning $\lambda_{t}$ can only depend on the past information up to the time step $t-1$. Finally, we define a random variable  $\tau: \Omega \to \mathbb{N} \cup \{\infty\}$ to be a \textit{stopping time} with respect to the filtration $\mathcal{F}$ if, for every $t \geq 0$, the event $\{\tau \leq t\}$ belongs to the sigma-algebra $\mathcal{F}_t$, \ie,
$\{\tau \leq t\} \in \mathcal{F}_t$. This condition ensures that the decision to stop at time $t$ can be made based only on the information available up to time $t$, meaning $\tau$ does not `see into the future'. We also make use of the following martingale concentration inequality in our results:

\begin{lemma}[Azuma-Hoeffding Inequality]
      Let $(\rv_i)_{i=1}^{t}$ be a martingale difference sequence adapted to a filtration $(\mathcal{F}_i)_{i=0}^{t}$, meaning: $
\mathbb{E}[\rv_i | \mathcal{F}_{i-1}] = 0, \   \forall i.$ Suppose there exist constants \( c_i \) such that for all \( i \), $|\rv_i| \leq c_i \  \text{almost surely}.$
Then for any \( \eta > 0 \) we have that \[
\mathbb{P} \left( \left| \sum_{i=1}^{t} \rv_i \right| \geq \eta \right) \leq 2\exp\left(-\frac{\eta^2}{2 \sum_{i=1}^{t} c_i^2} \right).\]
\end{lemma}

There also exists one-sided version of the above inequality as follows: \[
\mathbb{P} \left( \sum_{i=1}^{t} \rv_i \geq \eta \right) \leq \exp \left( -\frac{\eta^2}{2 \sum_{i=1}^{t} c_i^2} \right) \quad 
\text{ and similarly } \quad
\mathbb{P} \left( \sum_{i=1}^{t} \rv_i \leq -\eta \right) \leq \exp \left( -\frac{\eta^2}{2 \sum_{i=1}^{t} c_i^2} \right).
\]

\subsection{Details on the Sum-Process}
\label{app:sum-process}
As stated in \autoref{sec:method} the summation wealth process is given as $M_{t}\left(\psi\right) = \sum_{i=1}^{t}\lambda_i \left(\rz_i - \epsilon\right)$, with $(\lambda_t)_{t \in \gT}$ being the predictable betting rate. It is clear under the null $H_{0}\left(\psi\right)$ (\autoref{eq:hypotheses-formulation}) this forms a supermartingale, and hence does not grow. Furthermore, it is easy to see that $M_t(\psi)$ is a supermartingale sequence if $\mathbb{E}_{P_t}\left[\rz_t\left(\psi\right) \ \vert \ \mathcal{F}_{t-1}\right] < \epsilon, \ \forall t \in \gT$. Thus, we obtain an \emph{if and only if} characterization of the risk control condition, and hence if one can deduce that $M_{t}\left(\psi\right)$ is not a supermartingale, then this gives the evidence that the desired risk control assurance is violated. If $M_{t}\left(\psi\right)$ was a martingale sequence, \ie~when $\mathbb{E}_{P_t}\left[\rz_{t}\left(\psi\right) \ \vert \ \mathcal{F}_{t-1}\right] = \epsilon$, we may apply the one-sided Azuma-Hoeffding inequality to argue that the martingale sequence does not grow beyond a certain limit with high-probability. However, considering a sequence $(\tilde{\rz}_{t})_{t \in \gT}$ such that $\mathbb{E}_{P_t}\left[\tilde{\rz}_{t} \ \vert \ \mathcal{F}_{t-1}\right] = \epsilon$, it can be argued that $\tilde{M}_{t}\left(\psi\right) = \sum_{i=1}^{t}\lambda_i\left(\tilde{\rz}_{t} - \epsilon\right) \geq \sum_{i=1}^{t}\lambda_i\left(\rz_{t} - \epsilon\right)$, hence the one-sided Azuma-Hoeffding bound applied to $\tilde{M}_{t}\left(\psi\right)$ also extends to $M_{t}\left(\psi\right)$. Thus we can argue that
$$\mathbb{P}\left(\sum_{i=1}^{t}\lambda_i \left(\rz_i - \epsilon\right) \geq \eta\right) \leq \exp\left(-\frac{\eta^{2}}{2t}\right),$$
where the bounded assumption $\rz \in [0,1]$ and $\epsilon \in [0,1)$ result in boundedness of the difference sequence. Next, we can choose the threshold $\eta = \sqrt{2t \log \frac{1}{\delta}}$ and argue that $\mathbb{P}\left(M_{t}\left(\psi\right) \geq \eta\right) < \delta$. If $M_{t}\left(\psi\right)$ does indeed grow above $\eta$, then one can raise the alarm while retaining a false alarm control guarantee, as under the null $M_{t}\left(\psi\right)$ will not grow beyond $\eta$ with probability of at least $1-\delta$. We further note that with high-probability, $M_{t}\left(\psi\right)$ will remain bounded as $\gO(\sqrt{t})$. 


\subsection{Risk Control under the \emph{i.i.d.} Data Stream Setting}
\label{app:math-iid-stream}

Consider the simpler setting where the test stream originates \emph{i.i.d} from a non-shifting test distribution as $(\vx_t, \vy_t)_{t \in \gT} \sim P_0$. The risk quantity to monitor from \autoref{eq:hypotheses-formulation} directly simplifies due to independence as $\mathbb{E}_{P_t}\left[\rz_t \ \vert \ \mathcal{F}_{t-1}\right] = \mathbb{E}_{P_t}\left[\rz_t\right] = \gR_t(\psi)$, and since $P_0 = P_t \; \forall t \in \gT$ is static we have $\gR_t(\psi) = \gR_0(\psi)$ as a time-independent risk. We can now conveniently reverse our hypotheses under the sequential testing framework to exploit the fact that $P_0$ is static, and significant discoveries will thus hold even under future observations. That is, we can test for risk control directly by the hypothesis pair
\begin{align}
\label{eq:hypotheses-iid}
    H_{0}(\psi): \exists t \in \gT: \gR_0(\psi) > \epsilon, \qquad H_{1}(\psi) : \gR_0(\psi) \leq \epsilon \; \forall t \in \gT,
\end{align}
and use the following (reversed) wealth process and $\psi$-CS construction:
\begin{equation}
\label{eq:eprocess-iid}
    M_t(\psi) = \prod_{i=1}^{t}\left(1 + \lambda_{i}\left(\epsilon - \rz_i \right)\right) \quad \text{ and } \quad C_t^{\psi} = \{ \psi \in \gPsi: M_t(\psi) \geq 1/\delta \}. 
\end{equation}
Observe how once sufficient evidence is collected to support that the risk associated with a particular candidate $\psi$ does not exceed the tolerated risk level $\epsilon$, that candidate $\psi$ can be added to $C_t^{\psi}$ safely and indefinitely since the evidence collected remains meaningful under a static $P_0$. We can then directly leverage the Type-I error control property under the sequential testing framework \citep{ramdas2023game} (via Ville's Inequality) to state strong \emph{time-uniform} or \emph{anytime-valid} risk control guarantees. Specifically, it follows that for every $\psi \in \gPsi$ we have
\begin{equation}
\label{eq:risk-control-iid}
        \mathbb{P}_{H_0}(\exists t \in \gT: M_t(\psi) \geq 1/\delta) \leq \delta \;
        \Rightarrow \; \mathbb{P}_{H_0}(\exists t \in \gT: \gR_0(\psi) \leq \epsilon) \leq \delta \;
        \Rightarrow \; \mathbb{P}(\forall t \in \gT: \gR_0(\psi) \leq \epsilon) \geq 1 - \delta.
\end{equation}
In words, the probability of claiming risk control ($H_1$) under risk violation ($H_0$) is upper bounded by $\delta$, whereas under risk control we may perhaps mistakingly claim violation (and thus be overly conservative by excluding the associated $\psi$) but will not invalidate the risk level $\epsilon$. Thus, the overall probability of risk violation is controlled at level $1 - \delta$, rendering a strong safety assurance. Since $\gR_t(\psi)$ is not truly time-dependent neither is $C_t^{\psi}$, which will initially grow as evidence for each $\psi$ is collected and a decision on inclusion is made, and eventually stabilize. It is then straightforward to also recommend a particular threshold choice if the risk profile is monotonic or in some sense predictable, such as $\hat{\psi}_t := \min C_t^{\psi}$ as the least conservative threshold in case of a monotonically increasing risk. In other words, $\hat{\psi}_t$ will quickly tend to an optimal fixed choice $\hat{\psi}$ after a sufficient number of observations are processed.


\subsection{Proofs}
\label{app:math-proofs}
We provide proofs for our main theoretical statements from \autoref{sec:theory} below. We first restate each result for self-containment.

\subsection*{Proof of \hyperref[thm:valid-wealth-process]{Lemma~\ref{thm:valid-wealth-process}}.}

\noindent\textbf{Lemma~\ref{thm:valid-wealth-process}.} (Valid wealth process). 
\emph{
    For any $\psi \in \gPsi$ such that $\mathbb{E}_{P_t} [\rz_t \mid \mathcal{F}_{t-1}] \leq \epsilon \; \forall t \in \mathcal{T}$ satisfies the null, the process $M_t(\psi)$ in \autoref{eq:test-supermartingale} is a valid test supermartingale for the predictable betting rate  $\lambda_t \in [0, 1/\epsilon)$.
}

\begin{proof}
    For any $\psi \in \gPsi$, we consider the test-statistic of the form $M_{t}\left(\psi\right) = \prod_{i=1}^{t}\left(1 + \lambda_{i}\left(\rz_i - \epsilon\right)\right)$ where $(\lambda_t)_{t \in \gT}$ is a predictable process. We also have that $\rz_t \in [0,1]$ (\ie~we consider a bounded loss function; from \autoref{sec:background}), and the restriction on $\lambda_{t} \in [0, 1/\epsilon)$ renders the term $1 + \lambda_t \left(\rz_t - \epsilon\right)$ to be non-negative. Furthermore, $M_{t}\left(\psi\right)$ being adapated to the filtration follows from  $\lambda_{t}$ being predictable. Integrability of $M_{t}\left(\psi\right)$ follows from the boundedness assumption. Next we verify the supermartingale condition $\mathbb{E}_{P_t}\left[M_t\left(\psi\right) \ \vert \ \mathcal{F}_{t-1} \right] \leq M_{t-1}$. Since conditional on $\mathcal{F}_{t-1}$ randomness only originates from $\rz_t$, we have that $\mathbb{E}_{P_t}\left[M_{t}\left(\psi\right) \ \vert \mathcal{F}_{t-1}\right] = M_{t-1}\left(\psi\right) + \lambda_{t}\cdot M_{t-1}\left(\psi\right)\cdot \mathbb{E}_{P_t}\left[\rz_t - \epsilon \ \vert \ \mathcal{F}_{t-1}\right] \leq M_{t-1}\left(\psi\right)$, where the last inequality follows from the fact that for a valid $\psi$ we have $\mathbb{E}_{P_t}\left[\rz_t - \epsilon \ \vert \ \mathcal{F}_{t-1}\right] \leq 0$. Hence, we have shown that $M_{t}\left(\psi\right)$ is a valid test supermartingale (or wealth process) for $\psi$. 
\end{proof}

It is also easy to verify the converse direction for the following corollary:

\noindent\textbf{Corollary.}
\emph{
    $M_t (\psi)$ is a valid test supermartingale if and only if  $\mathbb{E}_{P_t} [\rz_t (\psi) \ | \ \mathcal{F}_{t-1}] \leq \epsilon, \quad \forall t \in \mathcal{T}$.
}

\paragraph{Remark.} So far in our approach, we have not put any restriction on the nature of stream, \ie~we can have arbitrary dependence between distributions $P_t$ and $P_{t'}, \ t \neq t'$. Thus, the highlighted approach encompasses general and realistic shift settings in the stream. However, in the scenario where the data stream under shift satisfies an independence assumption, \ie~when samples from $P_{t}$ and $P_{t'}$ are independent, the considered hypothesis pair further simplifies to $H_{0}\left(\psi\right) = \mathcal{R}_{t}\left(\psi\right) \leq \epsilon , \ \forall t \in \gT, \ H_{1}\left(\psi\right): \exists t \in \gT: \mathcal{R}_{t}\left(\psi\right) > \epsilon$ as then $\mathbb{E}_{P_t}\left[\rz_t \ \vert \ \mathcal{F}_{t-1}\right] = \mathbb{E}_{P_t}\left[\rz_t\right]$. Hence, the hypothesis formulation in \autoref{eq:hypotheses-formulation} is more general. 

\subsection*{Proof of \hyperref[thm:false-alarm]
{Lemma~\ref{thm:false-alarm}}.}

\noindent\textbf{Lemma~\ref{thm:false-alarm}.} (False alarm guarantee). 
\emph{
    For any $\psi \in \gPsi$ such that  $\mathbb{E}_{P_t} [\rz_t \mid \mathcal{F}_{t-1}] \leq \epsilon \; \forall t \in \mathcal{T}$ satisfies the null, it holds that $\mathbb{P} \left( \exists t \in \mathcal{T} : M_t (\psi) \geq 1/\delta \right) \leq \delta.$
}
    
\begin{proof}
    The above statement is a direct consequence of Ville's inequality which we state below for completion:
    
    \noindent\textbf{Ville's inequality \citep{ville1939etude}.} Given a non-negative supermartingale sequence $(M_{t})_{t \in \gT}$ such that  $M_0 = 1$, it holds that $$\mathbb{P}(\exists t \in \gT \ : \ M_{t} \geq 1/\delta) \leq \frac{\mathbb{E}[M_0]}{1/\delta} = \delta.$$
    Similar to the Azuma-Hoeffding inequality (\autoref{app:sum-process}), Ville's inequality gives probabilistic control on the growth of the supermartingale process. The false alarm guarantee then trivially follows from an interpretation of the obtained Type-I error control, as \hyperref[thm:valid-wealth-process]{Lemma~\ref{thm:valid-wealth-process}} asserts that $M_{t}\left(\psi\right)$ is a valid supermartingale for $\psi \in \gPsi$ such that $\mathbb{E}_{P_t} [\rz_t (\psi) \mid \mathcal{F}_{t-1}] \leq \epsilon, \forall t \in \mathcal{T}$. 
\end{proof}

\subsection*{Proof of \hyperref[thm:power-one-property]
{Lemma~\ref{thm:power-one-property}}.}

\noindent\textbf{Lemma~\ref{thm:power-one-property}.}  (Asymptotic consistency).
\emph{
    For any $\psi \in \gPsi$ such that $\mathbb{E}_{P_t}\left[\rz_t \ \vert \ \mathcal{F}_{t-1}\right] \leq \epsilon$ for finitely many steps $t \in \gT$ and $\mathbb{E}_{P_t}\left[\rz_t \ \vert \ \mathcal{F}_{t-1}\right] > \epsilon$ otherwise, it holds that ${\mathbb{P}(\tau(\psi) < \infty) = 1}$, where $\tau(\psi)$ denotes the stopping time.
}

\begin{proof}
    This is a simple consequence of the property of \emph{sequential test of power one} \citep{darling1968some} as stated in the main text. However, we provide a simple proof below based on the growth rate of $M_{t}\left(\psi\right)$ under the alternative ($H_{1}\left(\psi\right)$). The proof closely follows the ideas outlined in the consistency results from \citet{PandevaFRS24} (Prop. 4.2 in the paper). For notational clarity, we suppress the dependence on $\psi$ below.
    
    We first note that $\mathbb{P}\{\tau = \infty\} = \mathbb{P}\{\cap_{t\geq 1}\{\tau > t\}\} \leq \mathbb{P}\{\tau > t\}$. Taking the limit, $\mathbb{P}\{\tau = \infty\} \leq \limsup_{t \rightarrow \infty}\mathbb{P}\{\tau > t\}$. Next, we will argue thet $\lim\sup_{t \rightarrow \infty}\mathbb{P}\{\tau > t\}$ goes to zero under the alternative \emph{almost surely}. We have the wealth process $M_t = \prod_{i=1}^{t}\left(1 + \lambda_i \cdot \delta_i \right)$, $\delta_i = z_i - \epsilon$. Denoting $v_i = \log (1 + \lambda_i \cdot \delta_i)$, we define $S_t = \log M_t = \sum_{i=1}^{t}v_i$. Furthermore, let us denote $A_i = \mathbb{E}[\rv_i \ \vert \ \mathcal{F}_{t-1}]$. With this notation in place, we consider the event $\mathbb{P}\{\tau > t\}$, \ie~the stopping condition as follows.
    
    \paragraph{The stopping condition.} The event $\mathbb{P}\{\tau > t\}$ is the probability that the stopping time is greater than $t$ which from our stopping condition and general monotonicity arguments is $\mathbb{P}\{S_t < \log (1/\delta)\}$. Using our notation from above, we have $S_t = \sum_{i=1}^{t}v_i - A_i + \sum_{i=1}^{t}A_i$ which gives
    $$\mathbb{P}\{\tau > t\} = \mathbb{P}\left\{\frac{1}{t}\sum_{i=1}^{t}v_i - A_i + \frac{1}{t}\sum_{i=1}^{t}A_i < \frac{\log(\frac{1}{\delta})}{t}\right\}.$$
    
    \paragraph{Martingale difference sequence.} It is clear that the sequence $(v_i- A_i)_{i \in \gT}$ is a martingale difference sequence, and following the boundedness assumptions $|v_i - A_i| \leq \lambda_i$. And hence, we can apply the Azuma-Hoeffding's inequality to argue that $\mathbb{P}\left\{|\frac{1}{t}\sum_{i=1}^{t}v_i - A_i| > \frac{\eta}{t}\right\} \leq 2\exp\left(\frac{-\eta^{2}}{2 \sum_{i=1}^{t}\lambda_i^{2}}\right)$ for some $\eta$. Given $\lambda_i \leq \lambda_{\text{max}}$ (leveraging bounded betting rates), then $\sum_{i=1}^{t}\lambda_{i}^{2} =  t\cdot\lambda_{\text{max}}^{2}$. Choosing $\eta = \sqrt{t \log t}$, and defining the event $G_{t}^{c} = \left\{|\frac{1}{t}\sum_{i=1}^{t}v_i - A_i| > \frac{\eta}{t}\right\}$ to be an undesirable event where the martingale is overly fluctuating, we may state that $\mathbb{P}(G_{t}^{c}) \leq 2 \exp \{\frac{- \log t}{2 \lambda_{\text{max}}^{2}}\}$ (a decaying rate in $t$). Defining $G_{t}$ as a favourable event where the martingale difference remains controlled, we say taht $G_t = \left\{|\frac{1}{t}\sum_{i=1}^{t}v_i - A_i| \leq \frac{\eta}{t}\right\}$. Then, we can write $\mathbb{P}\{\tau > t\}$ using the law of total probability as below:
    \begin{align*}
    \begin{split}
        \mathbb{P}\{\tau > t\} & \leq \mathbb{P}\left(\left\{\frac{1}{t}\sum_{i=1}^{t}A_i < \frac{\log{1/\delta}}{t} +|\frac{1}{t}\sum_{i=1}^{t}v_i - A_i|\right\} \cap G_{t}\right)  + \mathbb{P}\{G_{t}^{c}\} \\
        &\leq \mathbb{P}\left(\left\{\frac{1}{t}\sum_{i=1}^{t}A_i < \frac{\log{1/\delta}}{t} + \sqrt{\frac{\log t}{t}}\right\} \cap G_{t}\right)  + \mathbb{P}\{G_{t}^{c}\} \\
        & \leq \mathbb{P}\left(\left\{\frac{1}{t}\sum_{i=1}^{t}A_i < \frac{\log 1/\delta}{t} + \sqrt{\frac{\log t}{t}}\right\}\right)  + \mathbb{P}\{G_{t}^{c}\}.
    \end{split}
    \end{align*}
     
     Next, taking the limit $\mathbb{P}\{\tau = \infty\} \leq \limsup_{t \rightarrow \infty}\mathbb{P}\{\tau > t\}$, we can bound the first term in the above expression, \ie~$$\mathbb{P}\{\tau = \infty\} \leq \limsup_{t \rightarrow \infty}\mathbb{P}\{\tau > t\} \leq \mathbb{P}\left(\left\{\frac{1}{t}\sum_{i=1}^{t}A_i < \frac{\log 1/\delta}{t} + \sqrt{\frac{\log t}{t}}\right\}\right).$$
    
    \paragraph{The alternative.} We are given that $\mathbb{E}[\rz_t - \epsilon \ \vert \ \mathcal{F}_{t-1}] > 0$ for infintely many steps $t$, and $\mathbb{E}[\rz_t - \epsilon \ \vert \ \mathcal{F}_{t-1}] \leq 0$ for finitely many steps $t$. Denote $\mu = \inf_{t \in \gT}\mathbb{E}_{H_1}[\rz_{t} - \epsilon \ \vert \ \mathcal{F}_{t-1}] > 0$. We assume that the betting rate $\lambda_i$ is small, and using the approximation $\log(1+x) \approx x$ we write $A_i = \mathbb{E}[\log (1 + \lambda_i \cdot \updelta_{i} \ \vert \ \mathcal{F}_{i-1}] \approx \lambda_{i} \cdot \mathbb{E}[\rz_i - \epsilon \ \vert \ \mathcal{F}_{i-1}]$. We further make the approximation that $\lambda_i$ is not exactly zero. By \emph{Cesàro means}, we have $\liminf_{t \rightarrow \infty} \frac{1}{t}\sum_{i=1}^{t}A_i \geq \liminf_{i \rightarrow \infty} A_i = \lambda \mu > 0$, where we use the definition of $\mu$. Now, for sufficiently large $t$ we have 
    $$\frac{1}{t}\sum_{i=1}^{t}A_i \gg \frac{\log 1/\delta}{t} + \sqrt{\frac{\log t}{t}},$$
    and hence $\frac{1}{t}\sum_{i=1}^{t}A_i$ grows faster than the other two terms shrink, leading to the probability $\lim \sup \mathbb{P}\{\tau > t\} \rightarrow 0$. Therefore, we obtain $\mathbb{P}_{H_1}\{\tau = \infty\} = 0$ and thus $\mathbb{P}_{H_1}\{\tau < \infty\} = 1$. 
\end{proof}

\subsection*{Details of \hyperref[thm:gro]
{Definition~\ref{thm:gro}}.} 

We refer to the relevant works such as \cite{waudby2024estimating, koolen2022log, shekhar2023near} on the notion of \emph{growth rate optimality} and related betting rates. \cite{waudby2024estimating} also refer to the condition as \emph{growth rate adaptive to the particular alternative} (GRAPA), whereas \cite{koolen2022log} label it the \emph{GROW} criterion.

\subsection*{Proof of \hyperref[thm:detection-delay-argument]{Proposition~\ref{thm:detection-delay-argument}}.}
\label{app:math-proofs-detection-delay}

\noindent\textbf{Proposition~\ref{thm:detection-delay-argument}.}  (Detection delay bound). 
\emph{
    A worst-case detection delay for the hypothesis pair in \autoref{eq:hypotheses-formulation} and wealth process $M_t(\psi)$ in \autoref{eq:test-supermartingale} is characterized by ${(\tau(\psi) - \tau_*(\psi)) \approx \gO((\log(1/\delta) \, + \, T)/(\lambda \cdot \mu))}$, where $\mu$ denotes the risk violation intensity and $T$ a shift changepoint.
}

\begin{proof}
    We first provide more clarification on the statement of this result. We consider a simplistic setting of risk violations, \ie~for some $\psi \in \gPsi$, $\exists \, T \in \gT$ such that $\mathbb{E}_{P_t}\left[\rz_t \ \vert  \ \mathcal{F}_{t-1}\right] \leq \epsilon$ for $t \leq T$ (\ie~the risk is within control until time step $T$), and $\mathbb{E}_{P_t}\left[\rz_t \ \vert \ \mathcal{F}_{t-1}\right]> \epsilon$ for $t > T$ (\ie~the risk gets violated after time step $T$). Furthemore, we assume $\mathbb{E}_{P_t}\left[\rz_t \ \vert \ \mathcal{F}_{t-1}\right] = \mu + \epsilon, \mu >0, \ t > T$, \ie~we assume that the mean deviates above $\epsilon$ with some fixed positive quantity $\mu$. Our setting now closely resembles a changepoint detection scenario. The goal of our detection delay argument is to study when a risk violation alarm will be raised, and ideally we do not want significant detection delays after reaching time period $T$. Our result will help characterize the flexibility of the methodology we employ to control these delays. 
    
    We first consider the simple summation test statistic described in \autoref{sec:method}), and once more suppress dependency on $\psi$ for notational clarity from now on. We then have the wealth process 
    $$M_t = \sum_{i=1}^{t}\lambda_{i}(\rz_i - \epsilon) = \sum_{i=1}^{T}\lambda_i(\rz_i - \epsilon) + \sum_{i = T+1}^{t}\lambda_{i}\left(\rz_i - \epsilon\right).$$ 
    Let us denote $\sum_{i=1}^{T}\lambda_i (\rz_i - \epsilon)$ to be $M_T$, \ie~$M_t = M_T + \sum_{i = T+1}^{t}\lambda_i\left(\rz_i - \epsilon\right)$. For simplicity, we consider a fixed betting rate $\lambda$ moving forward, and we express the pay-off term $(\rz_i - \epsilon)$ as $\left(\rz_i - \mathbb{E}[\rz_i \ \vert \ \mathcal{F}_{i-1}]\right) + \left(\mathbb{E}[\rz_i \ \vert \ \mathcal{F}_{i-1}] - \epsilon\right)$. Thus,
    $$M_t = M_T + \sum_{i=T+1}^{t}\lambda\left(\rz_i - \mathbb{E}[\rz_i \ \vert \ \mathcal{F}_{i-1}]\right) + \sum_{i=T+1}^{t}\lambda\left(\mathbb{E}[\rz_i \ \vert \ \mathcal{F}_{i-1}] - \epsilon\right).$$
    By the assumption in our setting, the last term resolves to $\lambda \cdot (t-T) \cdot \mu,$ and hence the whole expression becomes
    $$M_t = M_T + \lambda \mu (t-T)+ \sum_{i=T+1}^{t}\lambda (\rz_i - \mathbb{E}[\rz_i \ \vert \ \mathcal{F}_{i-1}]).$$
    We employ the same decomposition for $M_T$ and obtain the following expression as
    $$M_t = \underbrace{\sum_{i=1}^{T}\lambda\left(\mathbb{E}[\rz_i \ \vert \ \mathcal{F}_{i-1}] - \epsilon\right)}_{\text{$E_{T}$}} +  \lambda\mu\left(t - T\right) + \underbrace{\sum_{i=1}^{t}\lambda\left(\rz_i - \mathbb{E}[\rz_i \ \vert \ \mathcal{F}_{i-1}]\right)}_{\text{$S_{t}$}}.$$
    
    Now, it can be seen that $\left(\rz_i - \mathbb{E}[\rz_i \ \vert \ \mathcal{F}_{i-1}\right)_{i \in \gT}$ is a martingale difference sequence, and hence by use of Azuma-Hoeffding's inequality (\autoref{app:math-defn.-and-terminology}) will be contained.  
    % Hence, the accumulated evidence at the time step $t$ has historical growth $E_{T}$, a term that is linear in $t$, and another term $S_t$ that is sub-linear in $t$. 
    For some pre-specified threshold $b$, we define the stopping time $\tau = \inf\{t \ : \ M_{t} \geq b\}$. To argue for the detection delay, we consider the term $(\tau - T)$ and raise an alarm when 
    \begin{align*}
    \begin{split}
        &E_T + \lambda \mu (\tau - T) + S_\tau = b, \\
        &\Leftrightarrow \lambda\mu (\tau - T) = b - E_T - S_{\tau}, \\
        &\Leftrightarrow \tau - T = \frac{b - E_T - S_\tau}{\lambda \mu}.
    \end{split}
    \end{align*}
    Following this we can analyse the expectation of the detection delay $(\tau - T)$ as $\mathbb{E}\{\tau - T\} = \frac{b - E_T}{\lambda \mu}$ where $\mathbb{E}[S_{\tau}] = 0$. Since $E_{T} < 0$ surely, we have $E_{T}$ = 0 when $\mathbb{E}[\rz_i \ \vert \ \mathcal{F}_{i-1}] = \epsilon, \ \forall t \leq T$ (the best case setting). In the worst case, when $\mathbb{E}[\rz_i \ \vert \ \mathcal{F}_{i-1}] = 0, \ \forall t \leq T$ we obtain $E_{T} = -\lambda T \epsilon$,  leading to a worst-case expected detection delay of $(\tau - T) \approx \gO(\frac{b + T}{\lambda \mu})$. 
    
    We can also give a high-probability argument using a one-sided Azuma-Hoeffding bound. We then have $$\mathbb{P}\left(S_t  \geq \lambda \cdot \eta\right) \leq \exp\left(\frac{-\eta^{2}}{2t}\right),$$ and taking $\eta = \sqrt{t}/\lambda$ we get $\mathbb{P}\left(S_t \geq \sqrt{t}\right) \leq \exp\left(-\frac{1}{2\lambda^{2}}\right)$. Considering the worst case setting, we have ${(\tau - T) = \frac{b + \lambda T \epsilon - S_{\tau}}{\lambda \mu} \leq \frac{b + \lambda T \epsilon - \sqrt{\tau}}{\lambda \mu}}$ with high-probability, which further results in $(\tau - T) \approx \gO\left(\frac{b + \lambda T \epsilon}{\lambda \mu}\right)$. 
    
    \looseness=-1We can adopt the exact same arguments for our primary wealth process $M_t = \prod_{i=1}^{t}\left(1 + \lambda_i \left(\rz_i - \epsilon\right)\right)$ (\autoref{eq:test-supermartingale}) by considering $\log M_t$ and using the approximation $\log (1+x) \approx x$, that is valid for small enough $\lambda$. Considering $\log M_t \approx \sum_{i=1}^{t}\lambda \left(\rz_i - \epsilon\right)$ this then equates the sum-process considered in our argument above. In this case, $b$ will be replaced with $\log 1/\delta$, giving $(\tau - T) \approx \gO\left(\frac{\log (1/\delta) + T}{\lambda \mu}\right)$. Some intituitive insights from this expression are that \emph{(i)} the detection delay is directly proportional to $T$, the time step at which risk violation occurs---if the risk violations begin later, then a delay arises from overcoming the decay from the initial behavior; and \emph{(ii)} the detection delay is inversely proportional to both the betting rate $\lambda$ and the intensity of the violations $\mu$. However, we note that our methodology comes with the flexibility to control these delays to some extent by leveraging a smart betting rate design.
\end{proof}

% \section{Algorithmic details}
% \label{app:algo}

% \begin{algorithm}[tb]
%    \caption{Bubble Sort}
%    \label{algo:example}
% \begin{algorithmic}
%    \STATE {\bfseries Input:} data $x_i$, size $m$
%    \REPEAT
%    \STATE Initialize $noChange = true$.
%    \FOR{$i=1$ {\bfseries to} $m-1$}
%    \IF{$x_i > x_{i+1}$}
%    \STATE Swap $x_i$ and $x_{i+1}$
%    \STATE $noChange = false$
%    \ENDIF
%    \ENDFOR
%    \UNTIL{$noChange$ is $true$}
% \end{algorithmic}
% \end{algorithm}


\section{Additional Related Work}
\label{app:background}

\paragraph{Static risk control.} The framework of \emph{conformal prediction} constructs set predictors with upper bounds specifically on the miscoverage risk under \emph{i.i.d.} or exchangeable data, with a substantial recent body of literature (see, \eg, \cite{angelopoulos2023conformal, fontana2023conformal}). The approach has been extended to more general bounded risks by \cite{angelopoulos2024crc} leveraging similar exchangeability arguments. \cite{bates2021distribution} use concentration inequalities to provide high-probability assurances for monotonic expectation risks, and \cite{angelopoulos2021learn} extend the idea to non-monotonic risks by reframing the task as a non-sequential multiple testing problem. Related in spirit, \cite{angelopoulos2023prediction} propose the use of a hold-out unlabelled dataset to provide probability guarantees for confidence intervals on population-level parameters. 

\paragraph{Risk control for stream data and under shift.} Recent work on conformal prediction includes addressing non-exchangeable data sequences such as time series, \eg~by tracking and updating the tolerated miscoverage rate \citep{Gibbs2021AdaptiveCI, angelopoulos2024online, zaffran2022adaptive} or different weighting schemes \citep{barber2023conformal, guan2023localized}. Particular applications also include outlier detection \citep{Bates2021TestingFO, laxhammar2015inductive}. Recent work on data shifts has included covariate shift \citep{tibshirani2019conformal}, label shift \citep{podkopaev2021distribution} and their abstraction to more general shifts \citep{prinster2024conformal}. All of the above work predominantly focuses on the miscoverage risk, with \cite{Feldman2022AchievingRC} being an interesting extension of \cite{Gibbs2021AdaptiveCI} to more general bounded risks, discussed in \autoref{subsec:connection-methods}. Furthermore, obtainable guarantees are generally asymptotic or finite-sample only under relaxation (\eg, with respect to a permitted coverage deviation from the targeted guarantee). 

% \paragraph{Static risk control and extensions.} \looseness=-1Perhaps most prominently, the framework of \emph{conformal prediction} constructs set predictors with upper bounds specifically on the miscoverage risk under \emph{i.i.d.} or exchangeable data, with a substantial recent body of literature (see, \eg, \cite{angelopoulos2023conformal, fontana2023conformal}). Extensions to more general bounded risks include \cite{angelopoulos2024crc, bates2021distribution, angelopoulos2021learn} leveraging different concentration results, whereas \cite{angelopoulos2023prediction} explore the use of unlabelled calibration data. Recent work on conformal prediction also addresses shifted or non-exchangeable data sequences by tracking and updating the tolerated miscoverage rate \citep{Gibbs2021AdaptiveCI, angelopoulos2024online, zaffran2022adaptive} or different weighting schemes \citep{barber2023conformal, guan2023localized}, and settings include covariate shift \citep{tibshirani2019conformal}, label shift \citep{podkopaev2021distribution} and their abstraction to a more general shift \citep{prinster2024conformal}. \cite{Feldman2022AchievingRC} explore shift settings for more general bounded risks, and we elaborate on this connection in \autoref{subsec:connection-methods}.


\section{Additional Experimental Design}
\label{app:sec-exp-design}

\paragraph{Empirical-Bernstein wealth process.} We directly adopt the formulation as a test supermartingale or wealth process described in \cite{waudby2024estimating} (see Sec. 3.2 and Thm. 2 in their paper), given by
\begin{equation}
\label{eq:app-emp-bernstein}
    M^{EB}_{t}(\psi) = \prod_{i=1}^{t} \exp\{ \lambda_i \, (z_i - \epsilon) - v_i \, \rho(\lambda_i) \}
\end{equation}
with $v_i = 4\,(z_i - \hat{\mu}_{i-1})^2$ and $\rho(\lambda_i) = 1/4\,(- \log(1 - \lambda_i) - \lambda_i)$ for $\lambda_i \in [0,1)$, and using the suggested \emph{predictable plug-in} betting rate $\lambda^{EB}_i = \min\{\sqrt{\frac{2 \, \log(2/\delta)}{\hat{\sigma}^2_{i-1} \, i \, \log(1 + i)}}, c\}$. The estimates $\hat{\mu}_{i-1}$ and $\hat{\sigma}^2_{i-1}$ denote the empirical running mean and variance over the observed loss sequence $\{z_1, \dots, z_{i-1}\}$ up to time $i-1$, thus rendering $\lambda^{EB}_i$ predictable at every time step. We select $c = 1/2$ as a recommended constant $c \in (0,1)$, and omit the bias terms of $1/4$ and $1/2$ for $\hat{\mu}_{i-1}$ and $\hat{\sigma}^2_{i-1}$ respectively, which are negligable for sufficiently large streams. \cite{podkopaev2021tracking} leverage the process in its confidence sequence-equivalent form to estimate bounds on the running risk $R_r(\psi)$ in their problem setting, motivating its inclusion as a baseline.

\paragraph{Choice of betting rate.} We refer to \cite{waudby2024estimating} on the particular technical details of various betting rate designs, in particular their App. B.2 for GRO and App. B.3 for \emph{approximately GRO}. In essence, a direct optimization of the GRO condition (\hyperref[thm:gro]{Definition~\ref{thm:gro}}) can be achieved by exhaustive root-finding over a fine grid of possible values for $\lambda_t \in [0, 1/\epsilon)$. The approximation to GRO takes an additional Taylor approximation and truncation step to derive a closed-form solution for the root, thus being substantially more efficient. Our final betting rate expression takes this approach for a particular truncation aligned with our problem setting (\ie~the permitted range of $\lambda_t$). Other approximations to the GRO objective are also possible, such as solving for a lower-bound to the wealth (see their App. B.4 and following).

\paragraph{Batched data stream.} Assume we observe more than a single observation at every time $t$, \ie~the data stream with samples $\{ (\vx_{t,b}, \vy_{t,b})\}_{b=1}^{B} \sim P_t$ for some batch size $B \ll B_*$. Then we can average over the evidence in each batch to obtain a more robust measure of evidence for risk violations as
\begin{equation}
\label{eq:app-batching}
    M_t(\psi) = \prod_{i=1}^{t} \frac{1}{B} \sum_{b=1}^{B} (1 + \lambda_{i}(z_{i,b} - \epsilon)),
\end{equation}
leading to a reduced variance of the wealth process as well as reducing the detection delay $\tau(\psi) - \tau_*(\psi)$ with respect to the true risk. However this does not necessarily equate lower sampling costs, since the total number of observations is $B \cdot t$.

\paragraph{False alarm rate.} For a given experiment run or trial, a threshold candidate $\psi$ raises a false alarm (or is labelled a false positive) if $\gR_t(\psi) \leq \epsilon$ but $M_t(\psi) \geq 1/\delta$, thus erroneously claiming violation. Equivalently, we may state for the detection delay that $(\tau(\psi) - \tau_*(\psi)) < 0$ stops prematurely. The false alarm rate is then computed as the fraction of false alarms across $R$ trials, \ie
\begin{equation}
\label{eq:false-alarm-rate}
    \%FP = \frac{1}{R}\sum_{r=1}^{R} \mathbbm{1}[(\tau(\psi) - \tau_*(\psi)) < 0],
\end{equation}
and compared to the tolerated false alarm rate $\delta \in (0,1)$. If $\%FP > \delta$ then the Type-I error under $H_0(\psi)$ is uncontrolled, resp. any error control property violated.

\paragraph{Total error rate (TER).} In \autoref{subsec:exp-ood} the target risk to monitor is the \emph{total error rate} (TER), accounting for both cases of inlier (false positives, FP) and outlier misclassification (false negatives, FN). That is, we define the true risk $\gR_t(\psi) = \mathbb{E}_{P_t}[\rz_t]$ with loss variable
\begin{equation}
\label{eq:ood-ter}
    \rz_t = 
    \begin{cases}
        1, & \text{if } \texttt{out}(\rvx_t) \geq \psi \; \text{ and }\; (\rvx_t, \rvy_t) \sim P_{in}, \hfill \text{ (FP)} \\
        1, & \text{if } \texttt{out}(\rvx_t) < \psi \; \text{ and }\; (\rvx_t, \rvy_t) \sim P_{out}, \; \text{ (FN)} \\
        0 & \text{else.}
    \end{cases}
\end{equation}
The TER is a complex risk quantity that is both non-monotonic across time \emph{and} thresholds, since the FP and FN terms introduce competing objectives in terms of what constitutes a `safe' threshold $\psi$. Under a stepwise shift with increasing outlier fraction, the FP term initially weighs stronger, motivating a higher threshold choice (since the chance of an inlier mislabelling is reduced). However, as the outlier fraction increases the FN term becomes more relevant, motivating a lower threshold choice (\ie~increasing the chance of an outlier label). Therefore, no clear `trivial' safe threshold selection is available except for $\hat{\psi} = 1$ if $\pi_{t}^{out} = 0$ and $\hat{\psi} = 0$ if $\pi_{t}^{out} = 1$.

\paragraph{Miscoverage rate (MCR).} In \autoref{subsec:exp-sets} the target risk to monitor is the \emph{miscoverage rate}, accounting for set exclusion of the correct label $\vy_t$. That is, we define the true risk $\gR_t(\psi) = \mathbb{E}_{P_t}[\rz_t]$ with loss variable $\rz_t = \mathbbm{1}[\vy_t \notin \hat{f}_{\psi}(\rvx_t)]$. In this case, the MCR is closer to monotonically increasing over time as the natural temporal shift caused by both FMoW and Naval propulsion degrades model performance, and clearly monotonic in the thresholds. For FMoW, an indefinitely valid `safe' threshold with zero risk is given by $\hat{\psi} = 0$, resulting in a prediction set matching the full label space $\gY$, and thus $\rz_t = 0$ at every time step. For Naval propulsion, since $\gY \subseteq [0,1]$ and in practice $\vy_t \in [0.95, 1.0]$ we instantiate a fine grid of threshold candidates in the range $\gPsi := [0, 0.05]$. Thus for a sufficiently well-trained regressor, $\hat{\psi} = 0.05$ trivially ensures coverage (again returning the full response space) while thresholds towards zero place higher reliance on the prediction's accuracy at the risk of miscoverage. Clearly, such `trival' threshold solutions are generally impractical, but the MCR's monotonic behaviour renders it a more interpretable quantity and easier to track than the TER in \autoref{subsec:exp-ood}. 

\paragraph{Functional Map of the World dataset.} We consider the \emph{Functional Map of the World} dataset (FMoW) \citep{christie2018functional}, a large-scale satellite image dataset with 62 categories of building and land use, collected across various geographic regions and over 16 years (2002 -- 2017). We consider the time-dependent partitioning of FMoW proposed by \cite{yao2022wild}, wherein a natural shift occurs as land use for the \emph{same} satellite image locations, surveyed repeatedly over several years, changes over time. The predictor (a DenseNet-121) is trained on earlier years (2002 -- 2012), and we increase the test stream frequency to simulate daily observations by sampling chronologically from data in 2013 -- 2017 every 365 time steps (equating each passing year). This induces a (slow) step-wise shift, and we observe that the classifier's predictive accuracy worsens as time progresses, in line with results reported by \cite{yao2022wild}.

\paragraph{Naval propulsion system dataset.} We consider predictive maintenance (or equipment monitoring) data on naval gas turbine behaviour \citep{cipollini2018condition}. This tabular time series consists of $\sim 12 \, 000$ recordings for various turbine system parameters, and an associated turbine compressor degradation coefficient denoting the compressor’s health. Over time this degradation coefficient steadily increases from 0.95 to 1.0, denoting a gradual equipment decay. We supplement the data via jittered resampling of early observations (the first 2000 samples) to enrich the initial `healthy' compressor state, and train a Random Forest regressor on that data. Expectedly, as the compressor gradually degrades beyond its initial `healthy' range the predictor fails to extrapolate, resulting in decreased performance in line with a temporal distribution shift.

\newpage

\section{Additional Experimental Results}
\label{app:sec-exp-res}

\input{fig/fig_ood_noshift}
\input{fig/fig_ood_direct}
\input{tab/ood_step}
\input{tab/ood_noshift}
\input{tab/ood_direct}
\input{tab/fmow_natural}
\input{tab/uci_natural}