\section{An Assumption on the Rate of Rare Events and an Error Bound} \label{sec:guarantees}
We now turn to develop our algorithm and derive its statistical guarantees. Our first step is to define a
% Let us develop a bound on the error of hypotheses in our class $\gH$.
divergence that measures the extent to which rare events in a distribution $P$ are likely under distribution $Q$. Given a threshold $\beta > 0$, used to describe an event being ``rare", we consider the following divergence.
\footnote{It is worth noting that this notion of distance, taken w.r.t measurable subsets $\gB$ under the two distributions instead of the hypothesis class $\gH$, that is
$d_{1, \beta}\left( P \| Q \right) = \sup_{B\in{\gB}: P(B) \leq \beta}{2 \Big| P(B) - Q(B) \Big|}$, upper bounds $d_{\gH, \beta}$ and is perhaps more intuitive to reason about.}
\begin{definition}
For distributions $P, Q$ over domain $\gX$, a hypothesis class $\gH$ and $\beta > 0$, we define for each $g\in{\gH}$ the set it characterizes $I(g) = \left\{ \rvx | g(\rvx)=1 \right\}$ and denote,
\begin{align} %\label{eq:infrequent_divergence}
\restatableeq{\hdiv}{
d_{\gH, \beta}&\left( P \| Q \right) = \\
&\sup_{g\in{\gH}: P\left[I(g)\right] \leq \beta}{2 \Big| P\left[I(g)\right] - Q\left[I(g)\right] \Big|}. \nonumber}
{eq:infrequent_divergence}
\end{align}
\end{definition}
The divergence is similar to the well-known $\gH$-divergence from the domain adaptation literature \citep{bendavid2010adaptation, kifer2004detecting}, but has an additional rate constraint where $g(\rvx)$ may only make a fraction $\beta$ of positive predictions under $P$. We use this divergence to state our distributional assumption in what follows, in \cref{sec:disc_diverge} we also give a short discussion on properties of this divergence.

\textbf{The Scarcity-of-Unicorns Assumption.} Intuitively, if rare events (or ``unicorns'') under our source distribution $\Psource$ are common under $\Plabel{0}$, it is impossible to tell whether such events are novelties (i.e. were sampled from $\Plabel{1}$) or not. Therefore a bound on the rate of such rare events seems like a reasonable assumption to form the basis of our learning algorithm. In practice, users will have to set a parameter $\beta \geq 0$ that approximates the False Positive Rate (FPR) of an ideal classifier for the new category (which we denote by $\beta(h^*)$). For instance, if we expect to find distinct novel patterns in images, we may set $\beta=0$. An alternative, more involved scenario, may arise when we observe features such as vitals and lab results of patients, where a novel subpopulation can have some small overlap with previous data. Then regulators and domain experts may define appropriate values for this overlap that warrant further examination. The probability of these false positive events under the shifted distribution $\Plabel{0}$ appears as an additional error $\varepsilon_{\text{shift}}$ in our bound (that is presented in \cref{thm:main_result}), and our main assumption is that this error is bounded.
\begin{assumption}
For a known value $\beta \geq 0$ and $\varepsilon_{\text{shift}} \in {[0,1)}$ it holds that $d_{\gH, \beta}\left( \Psource \| \Plabel{0} \right) \leq \varepsilon_{\text{shift}}$.
\label{assum:unicorn_bound}
\end{assumption}
The error $\varepsilon_{\text{shift}}$ is incurred due to distribution shift, and it can be reduced for setting a smaller value for $\beta$. However, if $\beta$ is too low we cannot detect instances of the novel category. 
Our theoretical result provides guidance on how to scale $\beta$ with the sample size and complexity of $\gH$, however in general we must reason about $\beta(h^*)$ using domain knowledge, and this will be reflected by the term $\beta-\beta(h^*)$ in our error bound.
We discuss potential data-driven methods to reason about $\beta$ in the appendix, and close this part by emphasizing an important special case of \cref{assum:unicorn_bound}. 
Namely the separability assumption, common in PU-learning literature (e.g. \citep{bekker2020learning, gerych2022recovering}).
% \begin{proposition} 
\begin{restatable}{proposition}{seperability}
Assume separability holds, which postulates that $\Plabel{0}(B) > 0 \Rightarrow \Psource(B) > 0$ for any measurable subset $B$ w.r.t both distributions. \footnote{separability also assumes $\exists h^*\in{\gH}$ such that $R^{l_{01}}_{\gT}(h^*)=0$, but to prove \cref{prop:seperability} we do not require this.} Scarcity-of-Unicorns (\Cref{assum:unicorn_bound}) holds with $\beta, \varepsilon_{\text{shift}}$ set to $0$.
\label{prop:seperability}
\end{restatable}
\subsection{A Constrained Learning Rule and Its Generalization Properties} \label{sec:learning_rule}
We are now in place to present our learning rule and its statistical guarantee. The following theorem, that we prove in \cref{app:proofs_guarantees}, summarizes our proposal and result. We use the Rademacher complexity \citep{bartlett2002rademacher}, denoted by $R_{n,P}(\gH)$ for a distribution $P$ and sample size $n$, as a measure for the expressiveness of $\gH$, yet other standard notions can be used.
\begin{theorem} \label{thm:main_result}
Let $\langle \Psource, \Plabel{0}, \Plabel{1}, \alpha, n_{\gS}, n_{\gT} \rangle$ define an OOD novel category detection problem (see \cref{def:prob_setting}) and $h^*\in{\gH}$ the minimizer of $R^{l_{01}}_{\gH}$. The following statements hold:
\begin{itemize}[leftmargin=*]
\item Let $\beta(h)=\E_{\rvx\sim\Psource}{\left[h(\rvx)\right]}, \alpha(h)=\E_{\rvx\sim\Ptarget}{\left[h(\rvx)\right]}$ be the False Positive Rate (FPR) and recall of a hypothesis $h\in{\gH}$ w.r.t the task of classifying source and target data. The target risk on detecting the novel category can be bounded by \begin{align} \label{eq:err_upper_bound}
R^{l_{01}}_\gT(h) \leq [\alpha &- \alpha(h)] + \\
&(1-\alpha)\left[\beta(h) + d_{\gH, \beta(h)}\left( P_{\gS} \| \Plabel{0} \right)\right]. \nonumber
\end{align}
\item Let $\delta>0$ and assume our problem satisfies \cref{assum:unicorn_bound} with parameters $\beta \geq \beta(h^*) + \frac{R_{n_{\gS}, \Psource}(\gH)}{2} + \sqrt{\frac{\ln(1/\delta)}{2n_{\gS}}}$ and $\varepsilon_{\text{shift}}\geq 0$. Consider $\hat{h}=\gA(S_{\gS}, S_{\gT})$ that solves the empirical learning rule,
\begin{align}
\restatableeq{\precatrec}{
&\max_{h\in{\gH}}{\hat{\alpha}(h)} \\
&\text{s.t. } \hat{\beta}(h) \leq \beta \nonumber}{eq:precision_at_recall},
\end{align}
where $\hat{\alpha}(h), \hat{\beta}(h)$ are empirical estimates of $\alpha(h), \beta(h)$ from $S_{\gT}, S_{\gS}$ respectively. We have with probability at least $1-4\delta$ that
\begin{align} %\label{eq:empirical_err_bnd}
    \restatableeq{\emperrbnd}{
    R^{l_{01}}_{\gT}(\hat{h}) &\leq R^{l_{01}}_{\gT}(h^*) + 4\varepsilon_{\text{shift}} + 2(\beta-\beta(h^*)) \nonumber \\
    & + R_{n_{\gS}, \Psource}(\gH) + R_{n_{\gT}, \Ptarget}(\gH) \nonumber \\
    & + \sqrt{2\ln(1/\delta)}\left[ n_{\gS}^{-\frac{1}{2}} + n_{\gT}^{-\frac{1}{2}} \right]}{eq:empirical_err_bnd}.
\end{align}
\end{itemize}
\end{theorem}
Let us break down the statement and draw conclusions. The proposed learning rule in \cref{eq:precision_at_recall} optimizes an upper bound on the error, where the upper bound is drawn in the first part of the theorem (\cref{eq:err_upper_bound}). Unfortunately, the upper bound in \cref{eq:err_upper_bound} cannot be estimated from data, since a sample from $\Plabel{0}$ is required to estimate $d_{\gH, \beta(h)}\left( \Psource \| \Plabel{0} \right)$. This is where \cref{assum:unicorn_bound} comes in and lets us replace the divergence term, under the condition that $\beta(h)$ is small enough. Finally, we draw a generalization bound on the error of the learned classifier in \cref{eq:empirical_err_bnd}.

\textbf{Takeaways from \cref{thm:main_result}} Focusing on separable problems (see \cref{prop:seperability}), we may discard the terms that depend on $\varepsilon_{\text{shift}}$ and $\beta$ from the generalization bound in \cref{eq:empirical_err_bnd}.\footnote{That is if we set $\beta$ according to the separability assumption, approaching $0$ with growing $n_{\gS}$. Otherwise the error $\beta-\beta(h^*)$ does not approach $0$, reflecting how well we approximate $\beta(h^*)$.} Then we gather that the algorithm $\gA$ which solves \cref{eq:precision_at_recall} is a learning algorithm for the problem, as prescribed in \cref{def:prob_setting}, so long that $\gH$ is learnable under the standard terminology of learning theory \citep{shalev2014understanding}. Note that previously proposed approaches solve more general problems such as clustering \citep{jain2020class} or density ratio approximation \citep{gerych2022recovering}, and hence do not provide this type of learnability guarantee. Following the principle that one should not solve a more general problem than required \citep{vapnik2006estimation}, we opt for direct optimization of an upper-bound on the error, sidestepping such intermediate steps. We also conclude that upon using our proposed learning rule, the value of $\beta$ in the constraint of \cref{eq:precision_at_recall} should be set above $0$ even when separability holds (i.e. $\beta(h^*)=0$). It should scale with the complexity of $\gH$ and inversely with $n_{\gS}$. When separability does not hold, we incur an additional irreducible error proportional to $\varepsilon_{\text{shift}}$.

\begin{comment}
However, we can often reason about this quantity and in a sense the following assumption, which we use for the rest of our development, is a working definition of a novelty w.r.t $\Psource$.

\begin{restatable}{lemma}{upperbound}
For a novelty detection problem as in \Cref{def:prob_setting}, let $h\in{\gH}$ and denote $\alpha(h) = \E_{\Ptarget}{[h(\rvx)]}$, while $\beta(h) = \E_{\Psource}{[h(\rvx)]}$. Define,
\begin{align*}
\bar{R}^{l_{01}}_\gT(h) = [\alpha &- \alpha(h)] + \\
&(1-\alpha)\left[\beta(h) + d_{\gH, \beta(h)}\left( P_{\gS} \| \Plabel{0} \right)\right].
\end{align*}
Then we have that $R^{l_{01}}_\gT(h) \leq \bar{R}^{l_{01}}_\gT(h)$.
\label{lem:err_bound}
\end{restatable}

The proof of this statement is in \cref{app:proofs}. Intuitively, if we consider all examples from $\Psource$ as labelled with $y=0$, the term $d_{\gH, \beta(h)}\left( P_{\gS} \| \Plabel{0} \right)$ bounds the extent to which false positives from $\Psource$ can become ubiquitous under $\Plabel{0}$.


{\color{cyan} consider removing the following paragraph} Another interesting way to reason about the divergence is by using additional available data. For instance, if we have data from an additional source $S_{\text{aux}}$, where we know that a typical distribution shift occurred and no novelties have been introduced, then $d_{\gH, \beta(h)}\left( P_{\gS} \| P_{\text{aux}} \right)$ can be estimated efficiently from data by solving a rate constrained learning problem and used as an estimator for $d_{\gH, \beta(h)}\left( P_{\gS} \| \Plabel{0} \right)$. The algorithm we will use to minimize the upper bound of \Cref{lem:err_bound} is also based on rate-constrained learning, hence we leave the presentation of this algorithm to the next section and conclude our discussion on properties of $d_{\gH, \beta(h)}\left( P_{\gS} \| \Plabel{0} \right)$ by stating its equivalence to a rate-constrained classification problem {\color{cyan} TODO: write the formal result about this}. {\color{cyan} end of part to consider for removal}
\paragraph{Minimizing the Error Upper Bound.} The attractive property of \Cref{lem:err_bound} and \Cref{assum:unicorn_bound} is that the rest of the expressions in the bound that depend on $h$, namely $\alpha(h)$ and $\beta(h)$, can be estimated from data. Hence this directly suggests the following learning rule,
% \begin{align}
% \restatableeq{\precatrec}{
% &\max_{h\in{\gH}}{\alpha(h)} \\
% &\text{s.t. } \beta(h) \leq \beta. \nonumber}{eq:precision_at_recall}
% \end{align}
In case our bound on the false positive rate $\beta$ is larger than the false positive rate of the optimal hypothesis, we may bound the sub-optimality of the hypothesis learned by the above rule using $\beta, \varepsilon_{\text{shift}}$.
\begin{restatable}{lemma}{precrecallbound}
Let $h^*\in{\gH}$ a minimizer of $R^{l_{01}}_{\gT}(h)$ and assume that $\beta \geq \beta(h^*)$. For $\hat{h}$ that is optimal for \Cref{eq:precision_at_recall}
we have that $R^{l_{01}}_{\gT}(\hat{h}) \leq R^{l_{01}}_{\gT}(h^*) + (1-\alpha)\left[\beta + \varepsilon_{\text{shift}}\right]$.
\label{lem:upper_bound_ours}
\end{restatable}
Two points are worth mentioning with regards to this bound. When $\beta < \beta(h^*)$ it is also possible to draw a bound on the suboptimality of the solution to \cref{eq:precision_at_recall}, though  $\alpha(h)$. Another point is that in the domain adaptation literature, bounds on the difference of risks between a hypothesis $h$ and another hypothesis $h^*$ are usually expressed w.r.t to the $d_{\gH\Delta\gH}$ divergence instead of $d_{\gH}$ that we extend here.

To summarize, solving \Cref{eq:precision_at_recall} with some small pre-specified value of $\beta$ can entail guarantees on the near optimality of the solution. These guarantees depend on the validity of our assumptions laid out above, and as explained in \Cref{sec:dist_assum}, distributional assumptions are necessary to provide guarantees on novel group discovery, moreover under distribution shift. Our assumptions are a relaxed version of separability, and they dictate a constrained learning rule as a solution to our problem. To apply this learning rule, we now need to find computationally efficient techniques for estimating it from finite samples.
\end{comment}
% Presumably, in case $\beta < \beta(h^*)$ then this could limit the amount of novelties we recover


%%%% old weird counterexample
% Consider a toy problem with distributions over $3$ states, where $\Psource = [1-\epsilon, \epsilon, 0]$ and $\alpha=0.5$. Let us examine two cases for the ground-truth data generating process, one given by $\Plabel{0} = [0, 1-\epsilon, \epsilon], \Plabel{1} = [0, \epsilon, 1-\epsilon]$ for some small $\epsilon > 0$, and the other where we switch the labels $\tilde{P}_{\gT, 0} = [0, \epsilon, 1-\epsilon], \tilde{P}_{\gT, 1} = [0, 1-\epsilon, \epsilon]$. This means $\Ptarget = \alpha\Plabel{0} + (1-\alpha)\Plabel{1} = \alpha\tilde{P}_{\gT, 0} + (1-\alpha)\tilde{P}_{\gT, 1}$. Hence a learning algorithm that receives $\Psource$ and $\Ptarget$, observes the same inputs regardless of whether the ground truth is dictated by $\tilde{P}_{\gT, y}$ or $\Plabel{y}$. 
% On the other hand
% and not just a finite sample, it is impossible to learn a classifier that generalizes in both scenarios. , hence a PU-learning algorithm that receives $\Psource, \Ptarget$ gets the same input in both scenarios. On the other hand, it is easy to see that any hypothesis $h$ that achieves small error when the true distributions are $\Plabel{0}, \Plabel{1}$ (e.g. that returns $1$ for the third state and $0$ for the others, achieving $R_{\gT}(h) = \epsilon$), thus solving the problem with small error. On the other hand, if we switch the label in the target distribution and let $\Plabel{0} = [0, \epsilon, 1-\epsilon], \Plabel{1} = [0, 1-\epsilon, \epsilon]$, then the same hypothesis achieves error $1-\epsilon$. It easy to see that for any hypothesis that achieves low error on the first problem, will obtain high error for the second. Show that switching the roles of the $Y=0$ and $Y=1$ leads to the same input to the learner, and hence the problem is unidentifiable
% Established assumptions in the literature on our problem are separability
%%%% end old weird counterexample

% Consider a dataset $D_A = \left\{\rvx_i\right\}_{i=1}^{N_A}$ collected under a certain protocol, we formally treat this as an i.i.d sample from some \emph{nominal distribution} $P_A$. For instance medical records collected over certain years, or images collected at certain times and places. At a later time, or under different conditions, we collect more data $D_B = \left\{\rvx_i\right\}_{i=1}^{N_B}$ which contains a novel class/subgroup of size $\alpha\in{[0,1]}$ in the population, and it is unlabelled (i.e. we are not given any examples that are labelled as novelties). We treat this subgroup as a sample from a \emph{novelty distribution} $P_1$, and denote its size in the new data by the \emph{mixture proportion} $\alpha$. The common learning setting for this scenario is PU-Learning where $D_B$ is sampled i.i.d from the distribution $P_B = (1-\alpha)P_A + \alpha P_1$, and our task is to estimate $\alpha$ while learning a classifier that detects whether examples belong to the novel class. The problem is very well studied (see e.g. \citet{bekker2020learning} for a survey) and clearly it is important for safe deployment of machine learning models. For instance, if we use health records in $D_A$ to learn models that aid diagnosis, and the novel subgroup is markedly different from past data (e.g. not in the support of the nominal distribution $P_A$), we should indicate that our uncertainty regarding this group is high and perhaps refrain from prediction. Furthermore, we may wish to alert practitioners of the occurrence, revise our diagnostic models and further analyze the data. Solving the PU Learning task is critical for such downstream decision making.

% \vspace{80pt}
% \paragraph{The SCAR assumption.} In the problem description above we made a very strong assumption about our data, namely that the only difference between $P_A$ and $P_B$ is the addition of the novel class (i.e. this simply means that $P_B$ is a mixture of $P_A$ with $P_1$). This is called the Selected-Completely-At-Random assumption (SCAR) in the literature.\footnote{The name Selected-Completely-At-Random is suitable since we may think of the examples in $D_A$ as if they are labelled positively and those of $D_B$ as unlabelled. Then the assumption means that out of all the examples that were sampled from $P_A$ in the pooled data, $D_A\cup D_B$, the ones that were selected to be labelled positively (i.e. that belong in $P_A$) were selected completely at random.} However, it is plausible that there will be additional differences between data collected at different times or locations, other than the introduction of the new class. For example, we collect health records and the prevalence of certain symptoms, or phenotypes changes between $P_A$ and $P_B$ (regardless of the new class, which may correspond perhaps to a new symptom). Then clearly the SCAR assumption breaks, and many algorithms for the problem will fail in detecting the new class.
% Indeed, several works in recent years move beyond this assumption \citep{shanmugam2021quantifying, garg22adaptation, bekker2019beyond, gerych2022recovering, he2018instance}. We will review some of them in \cref{sec:related_works} below, see how they differ from the assumptions will explore here, and lead to different algorithms.

% \paragraph{Assumptions we will make in our development.} Before delving into the related literature, we should note that to give any meaningful result about the identifiability of $\alpha$ and of the novel class, some assumptions must be made. The main assumption we must make has to relate $P_B$ and $P_A$ in some manner. The only thing we can tell about $P_B$ in the absence of the SCAR assumption is that it is a mixture of the new class and \emph{some} distribution $P_0$ (i.e. $P_B = (1-\alpha) P_0 + \alpha P_1$, and $P_0\neq P_A$). Then perhaps the most generic way to relate these distributions is by common membership in a known uncertainty set of distributions $\gP$ (i.e. $P_0, P_A \in{\gP}$). Thus depending on our definition of $\gP$ we can recover several types of reasonable assumptions. In this paper we will mainly be interested in the case where $P_A \in \mathrm{relint}{\gP}$ (i.e. $P_A$ has non-zero mass wherever some $P\in{\gP}$ has non-zero mass) and the support of $P_1$ does not overlap with the support of any distribution in $\gP$. This will let us recover the novel group under very weak assumptions on $\mathrm{relint}{\gP}$ and hence gives a widely applicable algorithm. Even though this is a rather strong assumption, we will show that the methods we develop give meaningful results in different cases where the assumption does not hold. In \Cref{sec:beyond_no_overlap} we will discuss how the results of our proposed method should be interpreted in practice, and how future work can move beyond the settings we discuss here.


% \section{Related Work} \label{sec:related_work}
% {\color{cyan} Consider moving to the end, together with discussion}
% The problem is closely related to detection of anomalous examples, or Out-of-distribution detection \citep{ruff2021unifying}. In these problems a detector needs to be trained without access to the novel data, hence the goal is different as OOD detection does not necessarily consider is somewhat easier as it considers the case where the learner observes data which contains novelties.

% Another related setting is Open-set classification or Open-World learning \ldots little formal guarantees

% Uncertainty estimation and providing guarantees on loss under distributions shift. Cite evaluation paper and conformal prediction papers, calibration, IRM, nurd etc.

% PU-learning under the SCAR aassumption, say closest works on PU-learning under distribution shift will be covered later with more context \ldots.
% Indeed, several works in recent years move beyond this assumption \citep{shanmugam2021quantifying, garg22adaptation, bekker2019beyond, gerych2022recovering, he2018instance, kato2018learning}. Most methods are based on the assumption that $\Psource(\rvx) / \Plabel{0}(\rvx)$ can be obtained for each example in our dataset. Intuitively, once this ratio is known we can uses reweighting or resampling to transform the biased problem into a standard PU-learning problem that adheres to the SCAR assumption. Methods based on this principle that are applicable to our setting \citep{kato2018learning, gerych2022recovering} involve a density ratio estimation as a preliminary step for using standard PU-learning methods. At the population level where we are given infinite data, these methods are guaranteed to ``debias" the problem, under some distributional assumptions that we shall discuss in \Cref{sec:dist_assum}. However, density ratio estimation can be prohibitive in terms of computation time and accuracy when we are concerned with high-dimensional data. Furthermore, it is difficult to derive standard generalization bounds (that do not depend on the dimension of the features) for the downstream detection problem.

% To sidestep the challenging density ratio estimation, our method relies on recent advances in constrained learning where we wish to minimize expected losses under data dependent constraints involving expected values \citep{cotter2019optimization, cotter2019training, chamon2022constrained}. The potential of constrained learning in the PU setting has been noted in early works on the topic \cite{liu2002partially}. Yet lacking the tools to tackle this problem directly, they usually turn to alternative methods \cite{liu2002partially}, or ones that are applicable to specific hypothesis classes, like SVMs ** cite Joachims **. Constrained learning methods can be applied to flexible hypothesis classes and come with generalization bounds, making them an attractive solution to the problem.