
\looseness=-1We next describe our notation, problem setting and task in detail, highlighting some key distinctions to existing work.

\paragraph{Notation and risk quantity.} \looseness=-1Let $\gX \times \gY$ denote the sample space with a data-generating distribution $P$ over it, and $\rvx, \rvy$ random variables with realizations $\vx, \vy$.\footnote{Upright lettering denotes random variables and italic lettering their realizations. Boldening denotes multi-dimensional quantities.} We consider access to the outputs of a base predictor $\hat{f}: \gX \rightarrow \gS$, where $\gS \subseteq \mathbb{R}^{|\gY|}$ for classification or $\gS \subseteq \mathbb{R}$ for regression. This model may have been explicitly trained by the user, but can in particular denote a pretrained model without internal access, \eg~accessible via an API. Next, similar to existing approaches for risk control \citep{angelopoulos2021learn, angelopoulos2024crc, Feldman2022AchievingRC}, we equip the model with a general decision-making mechanism of the form
\begin{equation}
\label{eq:def-threshold-pred}
    \hat{f}_{\psi}(\rvx) = 
    g(\hat{f}(\rvx), \psi),
\end{equation}
where $\psi \in \gPsi,\,\gPsi \subseteq [0,1]$ denotes a particular threshold value and $g$ a generic operator instantiated for each task-specific thresholding mechanism. For example, we can define $g$ as a binary decision on outlier flagging given some outlier score computed using $\hat{f}$ (see \autoref{subsec:exp-ood}). Finally, a notion of error for $\hat{f}_{\psi}$ and any particular threshold $\psi$ is captured by a problem-specific \emph{supervised and bounded} loss function $\ell: \gX \times \gY \times \gPsi \rightarrow \gL,\,\gL \subseteq [0,1]$, and the resulting \emph{true population risk} is given by the expected loss
\begin{equation}
\label{eq:def-risk}
    \gR(\psi) = \mathbb{E}_{P}[\ell(\hat{f}_{\psi}(\rvx), \rvy)].
\end{equation}
Because $\ell \in \gL$ is bounded it also follows that $\gR(\psi) \in [0,1]$. Boundedness of the loss constitutes our key restriction, but we place no conditions on the particular distribution of losses within those bounds \citep{waudby2024estimating}. To simplify notation we additionally define $\rz = \ell(\hat{f}_{\psi}(\rvx), \rvy)$ as a random variable of the loss with realization $z$, and equivalently express the risk in \autoref{eq:def-risk} as $\gR(\psi) = \mathbb{E}_{P}[\rz]$. Crucially, $\gR(\psi)$ denotes the quantity of interest for which safety assurances of some form are desired in order to robustify decisions made using $\hat{f}_{\psi}$ (and indirectly, $\hat{f}$).

\paragraph{Static risk control.} \looseness=-1Assume the deployment of $\hat{f}_{\psi}$ on new \emph{i.i.d.} test data $\gD_{test} \sim P_0$, and access to representative labelled \emph{i.i.d.} calibration data $\gD_{cal} \sim P_0$. Following existing frameworks of risk control such as \emph{RCPS} \citep{bates2021distribution} or \emph{Learn-then-Test} \citep{angelopoulos2021learn}, $\gD_{cal}$ can be leveraged to identify a subset $\hat{\gPsi} \subseteq \gPsi$ of \emph{risk-controlling} thresholds which ensures a high-probability upper bound on the population risk. That is, for any $\hat{\psi} \in \hat{\gPsi}$ we may state that $\mathbb{P}(\gR(\hat{\psi}) \leq \epsilon) \geq 1 - \delta$ holds. The risk level $\epsilon \in (0,1)$ and probability level $\delta \in (0,1)$ are user-specified, and dictate how tightly the risk is to be controlled. For instance, selecting low values for both $\epsilon$ and $\delta$ will enforce strong guarantees but may result in overly conservative decision-making on the basis of a chosen $\hat{\psi}$. Crucially, these approaches operate in a \emph{static} batch setting where the set $\hat{\gPsi}$ is computed once and deployed indefinitely, and are limited by their assumption on a \emph{static} distribution $P_0$ over time.

\paragraph{Data stream setting under shift.} \looseness=-1Instead, let us consider a more dynamic stream setting at deployment time. Given a time index set $\gT = \{1, \dots, T\}$, at every time step $t \in \gT$ a covariate $\vx_t$ is obtained, a decision is made using $\hat{f}_{\psi}(\vx_t)$, and subsequently $\vy_t$ is revealed and the loss $z_t$ measured. Thus, the flow of information at each step follows as \emph{covariate $\rightarrow$ decision $\rightarrow$ label $\rightarrow$ loss}, and at time $t$ the observational history $\{(\vx_i, \vy_i, z_i)\}_{i=1}^{t-1}$ is available. If we assume that the test stream originates \emph{i.i.d} $(\vx_t, \vy_t)_{t \in \gT} \sim P_0$, risk control frameworks as above could be directly applied after observing sufficient samples, and we elaborate further on this simpler setting in \autoref{subsec:connection-methods}. In this work, we address the challenging extension to the stream case under \emph{time-dependent and unknown} distribution shifts. Specifically, we consider a data stream observed as $(\vx_t, \vy_t) \sim P_t$ for $t \in \gT$, where samples at every time step originate from a time-dependent distribution $P_t$ which may shift, and in particular tends to deviate away from any initial $P_0$. Our risk quantity of interest then becomes $\gR_t(\psi) = \mathbb{E}_{P_t}[\rz_t]$, the \emph{time-dependent} true population risk at any given time $t$, and any obtained threshold set $\hat{\gPsi}_t \subseteq \gPsi$ is similarly time-dependent. We suppose minimal knowledge and place \emph{no assumptions} on the nature of the shift, which may be caused by a single static jump, gradual, \emph{etc.}, and originate in the covariates, labels, or both. Expectedly, the resulting high unpredictability on future risk development renders it substantially harder to provide safety assurances of any kind, but poses a commonly encountered problem setting in practice. Faced with such a challenge at deployment, we examine how to continuously monitor the true risk $\gR_t(\psi)$ for candidates $\psi$ and identify when violations of the form $\gR_t(\psi) > \epsilon$ occur. 