\begin{figure*}[t]
\begin{minipage}[!t]{1\columnwidth}
    \centering
    \includegraphics[scale=0.46]{figures/toy_example.png}
    % \includegraphics[width=10cm]
    % \caption{AAA}\label{fig:AAA}
\end{minipage}
% \hfill{}
\begin{minipage}[!t]{1\columnwidth}
    \centering
    \includegraphics[scale=0.46]{figures/toy_example_rocs.png}
    %width=20cm, bb=0 0 1200 900
    % \includegraphics[width=0.4\linewidth]
    % \caption{BBB}\label{fig:BBB}
\end{minipage}
% \begin{subfigure}
% \includegraphics[width=0.4\linewidth]{figures/toy_example.png}
% \end{subfigure}
% \begin{subfigure}
% \includegraphics[width=0.4\textwidth]{figures/toy_example_rocs.png}
% \end{subfigure}
\caption{\textbf{(Left)} Toy example where a classifier learned with \ours{} is favorable over a domain discriminator in detecting a novel category. A domain discriminator is trained to reduce overall loss and hence it is biased towards labelling the upper right cluster with label $1$ (i.e. as a novelty). \textbf{(Right)} ROC-Curves for a domain discriminator and a model trained with \ours{} for the data in the left panel, where $\datatarget$ are labeled positive and $\datasource$ negative. An optimal classifier for the novel category is suboptimal w.r.t aggregate performance metrics (e.g. AU-ROC), but has higher TPR when constrained to a small FPR, illustrating why our constrained learning approach can recover novel categories successfully.}
\label{fig:toy_example}
\end{figure*}
\section{Problem Setting} \label{sec:problem_setting}
% The setting we focus on is \emph{novel class detection under distribution shift}. The formal problem we study appears in the literature under the name biased Positive and Unlabelled learning (PU-learning), we use another name in order to emphasize our focus on novelty detection. Other problems, e.g. estimation of underreporting of medical conditions under selection bias as in \citet{shanmugam2021quantifying}, may share some formalism with ours but the different motivations lead to other sets of assumptions and goals.
In \emph{OOD Novel Category Detection} we seek to detect a novel category (also called novel class, or subgroup) within a dataset that contains both known and novel categories. Crucially, \emph{the distribution of known categories can shift}.

Consider a dataset $\datasource = \left\{\rvx_i\right\}_{i=1}^{N_{\gS}}$ collected under a certain protocol, we formally treat this as an i.i.d sample from some \emph{source distribution} $\Psource$. For instance, in our healthcare example, data collected in the months preceding the pandemic.
%medical records collected over certain years, or images collected at certain times and places. 
At a later time or under different conditions, we collect more data $\datatarget = \left\{\rvx_i\right\}_{i=1}^{N_{\gT}}$ which contains a novel category that we would like to detect, of proportion $\alpha\in{[0,1]}$ in the population. The category is unlabelled, i.e. we are not given any examples that are labelled as novelties. We treat this category as a sample from a \emph{novelty distribution} $\Plabel{1}$, and call its proportion in the new data, $\alpha$, the \emph{mixture proportion}. 
The rest of the data in $S_{\gT}$ is sampled from a nominal distribution $\Plabel{0}$, which we think of as a shifted version of $\Psource$. In summary, $S_{\gT}$ is an i.i.d sampled dataset from $\Ptarget = (1-\alpha)\Plabel{0} + \alpha\Plabel{1}$. Our task is as follows.
\begin{definition} [OOD Novel Category Detection] \label{def:prob_setting}
The tuple $\langle \Psource, \Plabel{0}, \Plabel{1}, \alpha, n_{\gS}, n_{\gT} \rangle$ defines an OOD novel category detection problem where $\datasource, \datatarget$ are datasets of $n_{\gS}, n_{\gT}$ examples sampled i.i.d from $\Psource, \Ptarget$ respectively, where $\Ptarget = (1-\alpha)\Plabel{0} + \alpha\Plabel{1}$. For a hypothesis class of binary classifiers $\gH$, Let $h^*\in{\gH}$ be the minimizer of the expected $0-1$ risk over the target distribution:
\begin{align} \label{eq:risk}
R^{l_{01}}_{\gT}(h) = &(1-\alpha)\cdot\E_{\rvx\sim \Plabel{0}} \left[ h(\rvx)  \right] \nonumber \\
&+ \alpha \cdot \E_{\rvx\sim \Plabel{1}} \left[ 1-h(\rvx)  \right].
\end{align}
An algorithm $\gA:\gX^{n_{\gS}}\times\gX^{n_{\gT}} \rightarrow \gH$ is a learner for the novel class detection problem if for every $\varepsilon, \delta > 0$ it satisfies $R^{l_{01}}_{\gT}(\gA(S_{\gS}, S_{\gT})) \leq R^{l_{01}}_{\gT}(h^*) + \varepsilon$ with probability at least $1-\delta$ whenever $\min\{n_{\gS}, n_{\gT}\} \geq m_{\gH}(\varepsilon^{-1}, \delta^{-1})$ for a function $m_{\gH}:[0,1]^{2}\rightarrow \sN$.
\end{definition}
Further, in this problem $\Psource$ and $\Plabel{0}$ may contain different mixture proportions of the same latent subpopulations (\citet{duchi2022distributionally, Sagawa2020Distributionally}). This allows us to tackle challenging scenarios like the healthcare scenario described earlier where the types of patients visiting the hospital changes over time including the introduction of new COVID-19 related patient subgroups. Later we will specify the precise distribution shifts that we treat. Denoting the distributions corresponding to subpopulations by $\{G_i\}_{i=1}^{K}$ for some $K\in{\sN}$, and the probability simplex over $[K]$ by $\Delta^{K-1}$,
\begin{align} \label{eq:varying_mixtures}
\Psource = \sum_{i=1}^{K}{\gamma_i G_i}, ~ \Plabel{0} = \sum_{i=1}^{K}{\hat{\gamma_i} G_i}, ~ \rvgamma, \hat{\rvgamma}\in{\Delta^{K-1}}.
\end{align}
% \attendto{TODO: fix this paragraph and also fix up the PU-learning reference in the next one. Should define the acronym, and maybe also make it not sound like we're just reiterating an old problem.}

%Returning to our example of deploying risk prediction tools in hospitals, the sub-populations correspond to patients with different underlying clinical condition and demographics. The prevalence of these sub populations changes over time, that is: for example, data in the month preceding the onset of the pandemic sampled from $\Psource$, and in the month after onset as sampled from $\Ptarget$, which additionally also has a novel group of size $\alpha$. Somewhat surprisingly, formal treatment in earlier work focuses mainly on problems without distribution shift.

% The common learning setting for this scenario is PU-Learning where $\datatarget$ is sampled i.i.d from the \emph{target distribution} $\Ptarget = (1-\alpha)\Psource + \alpha \Plabel{1}$, and our task is to learn a classifier that detects whether examples belong to the novel group.\footnote{This includes the Mixture Proportion Estimation problem of approximating $\alpha$ which is often treated as a distinct task, e.g. \citet{pmlr-v38-scott15}.} The problem is very well studied (see \citet{bekker2020learning} for a survey) and certainly, solving it can be helpful towards safe deployment of machine learning models. For instance, if we use health records in $\datasource$ to learn models that aid diagnosis, and the novel subgroup is markedly different from past data (e.g. not in the support of the source distribution $\Psource$), we should indicate that our uncertainty regarding this group is high and perhaps refrain from prediction. Furthermore, we may wish to alert practitioners about the occurrence, revise our diagnostic models and further analyze the data. Solving the PU Learning task is critical for such downstream decision making.

\subsection{Motivating Example} \label{sec:toy_example}
To motivate our solution consider a simple case of latent subpopulation shift, as in \cref{eq:varying_mixtures}, plotted in \cref{fig:toy_example} (Left). There are two latent subpopulations that make up known categories in $\datasource$, and the novel category is marked with a dashed circle. Let us examine how a method that does not handle distribution shift works in this example.

\textbf{Detection without distribution shift.} Formally, our problem can be cast in the framework of learning from Positive and Unlabelled data (PU-learning). Most work under this framework relies on the Selected-Completely-At-Random assumption (SCAR) \citep{elkan2008learning}, that is $\Psource=\Plabel{0}$. Besides being very restrictive, it turns out that many of the approaches based on SCAR can fail when it breaks. Common algorithms for PU-learning are based on a classifier trained to distinguish the domains $\Psource$ and $\Ptarget$ (Domain Discriminator). Intuitively, this approach is effective since examples from the novel class turn out to be ``farthest" from the decision threshold. Then adjustment of the decision threshold according to a successful Mixture Proportion Estimation (MPE), i.e. an estimate of $\alpha$, should enable us to classify novelties \citep{elkan2008learning, duplessis2014analysis, garg2021mixture}. The example in \cref{fig:toy_example} shows this approach can run into problems when there is distribution shift between $\Psource$ and $\Plabel{0}$. A domain discriminator trained with logistic regression is biased towards separating the categories that are observed in the source data $S_{\gS}$. This is due to their varying mixture coefficients between $\Psource$ and $\Plabel{0}$, and it results in the examples from the novel category not being farthest from the decision boundary.

\textbf{Why \ours{} solves this.} However, a linear classifier \emph{can} separate the novel category, and the method we propose in this work is able to recover it as can be seen for the classifier labelled \ours{} in \cref{fig:toy_example} (Left). In a nutshell, \ours{} seeks to maximize the number of points in $\datatarget$ that are detected as novelties, while keeping the number of points in $\datasource$ that are wrongly detected as novelties below a certain threshold. \cref{fig:toy_example} (Right) illustrates why this approach is expected to work by plotting the Receiver-Operator Curve (ROC) for two models, the domain discriminator and the one trained with \ours{}.
The domain discriminator does better (in terms of aggregate classification metrics such as average loss, or F1-Score) in classifying $\datasource$ vs. $\datatarget$, and note that larger distribution shifts further improve its discriminative ability. This is clear from the figure, as its ROC curve dominates the other one for most values of the False Positive Rate (FPR). However, an \emph{optimal novel category detector} (in our case this coincides with \ours{}) has better True Positive Rate (TPR) for small FPR values, as observed in \cref{fig:toy_example} (Right). Intuitively, this model sees a sharp increase in the TPR for low FPR values due to correct classifications of the novel category. Hence our suggested approach should prefer the novel category detector over the domain discriminator. But when is this approach guaranteed to detect the novel category? What are the required assumptions, sample size, and how should we set the bound on the FPR? In the following sections we provide answers to these questions and an implementation of the proposed principle.
% Intuitively, an optimal classifier for the novel category (which in this example coincides with \ours), will see a sharp increase in the True Positive Rate (TPR) for low values of the False Positive Rate (FPR) due to correct classifications of the novel category. Yet as we consider higher FPRs, this classifier will be sub-optimal w.r.t a Domain Discriminator that performs well in terms of aggregate classification metrics such as Accuracy, or F1-Score due to the distribution shift. Hence instead of optimizing such aggregate scores, \ours considers optimization of TPR under bounded values of FPR. In \cref{fig:toy_example} (Right) we see that indeed at these low FPR values the curve for \ours lies above that of the Domain Discriminator. But when is this approach guaranteed to detect the novel category? What are the required assumptions, sample size, and how to we set the bound on the FPR? In the following sections we provide answers to these questions and an implementation of the proposed principle.

Concluding this example, we note that other solutions can be devised for the specific dataset we considered.
For instance clustering, or training a domain discriminator from a larger hypothesis class. Yet these solutions do not extend gracefully to more general settings.
For instance, it is unlikely that in every dataset of interest, clustering high dimensional data retrieves the accurate subgroups that undergo shift. Expressive hypothesis classes are also not a reliable solution, as they introduce biases of their own. For example, it is well-known that large overparameterized models tend to perform poorly on small subgroups \citep{pmlr-v80-hashimoto18a, pmlr-v119-sagawa20a, menon2021overparameterisation, wald2022malign}. In our setting these may correspond to the novel category which comprises a small part of $S_{\gT}$, thus detection of the novel category may be poor.
% We finish this part with a formal definition of the problem, and then turn to discuss when and how it can be solved.
% \vspace{-10pt}
% \begin{figure*}
% \begin{minipage}[!t]{1\columnwidth}
%     \centering
%     \includegraphics[scale=0.46]{figures/toy_example.png}
%     % \includegraphics[width=10cm]
%     % \caption{AAA}\label{fig:AAA}
% \end{minipage}
% % \hfill{}
% \begin{minipage}[!t]{1\columnwidth}
%     \centering
%     \includegraphics[scale=0.46]{figures/toy_example_rocs.png}
%     %width=20cm, bb=0 0 1200 900
%     % \includegraphics[width=0.4\linewidth]
%     % \caption{BBB}\label{fig:BBB}
% \end{minipage}
% % \begin{subfigure}
% % \includegraphics[width=0.4\linewidth]{figures/toy_example.png}
% % \end{subfigure}
% % \begin{subfigure}
% % \includegraphics[width=0.4\textwidth]{figures/toy_example_rocs.png}
% % \end{subfigure}
% \caption{(Left) Toy example where a classifier learned with constrained learning is favorable over a domain discriminator in detecting a novel category. A domain discriminator is trained to reduce overall loss and hence it is biased towards labelling the upper right cluster with label $1$ (i.e. as a novelty). As explained in \cref{sec:learning_rule}, the classifier trained with our constrained learning approach, will only require a certain portion of the points from $S_{\gT}$ are labelled as novelties, then it will not incur additional loss for the examples in $S_{\gT}$ where label $0$ is assigned. This mitigates the bias introduced by distribution shift.}
% \label{fig:toy_example}
% \end{figure*}

\begin{comment}
\paragraph{PU-Learning under distribution shift.} 
The problem formulation above relies on a very strong assumption, namely that the only difference between $\Psource$ and $\Ptarget$ is the addition of the novel group, which means that $\Ptarget$ is a mixture of $\Psource$ with $\Plabel{1}$. This is called the Selected-Completely-At-Random assumption (SCAR) in the literature \citep{elkan2008learning}.
Besides being very restrictive, it turns out that many of the approaches based on the SCAR assumption can fail when it breaks. Common algorithms for PU-learning are based on a classifier trained to distinguish $\Psource$ and $\Ptarget$ (PvU classifier), and then adjust the decision threshold according to their Mixture Proportion Estimation (MPE) \citep{elkan2008learning, kiryo2017positive, garg2021mixture}. As shown in Fig. {\color{cyan} TODO: show a plot of a 2D problem that demonstrates failure of a standard Positive vs. Unlabelled classifier in this setting, while our approach succeeds}, this approach can run into problems in the scenario of \Cref{eq:varying_mixtures}. A logistic regression model trained to distinguish between $\datasource$ and $\datatarget$ is biased towards separating the two seen subgroups, and the examples from the novel subgroup are not the farthest ones from the decision boundary. However, a linear classifier can still separate the novel subgroup, and the method we propose in this work will be able to recover it. We finish this part with a formal definition of the problem, and then turn to discuss when and how it can be solved.
\end{comment}

% \footnote{The name Selected-Completely-At-Random is suitable since we may think of the examples in $\datasource$ as if they are labelled positively and those of $\datatarget$ as unlabelled. Then the assumption means that out of all the examples that were sampled from $\Psource$ in the pooled data, $\datasource \cup \datatarget$, the ones that were selected to be labelled positively, i.e. that were drawn from $\Psource$, were selected completely at random.}

% However, it is plausible that data collected at different times or locations will differ in additional aspects, other than the introduction of the new group. For example, we collect health records and the prevalence of certain symptoms, or phenotypes changes between $\Psource$ and $\Ptarget$, regardless of the new group which may correspond perhaps to a newly observed symptom. Then clearly the SCAR assumption breaks, and standard algorithms for the problem will be sub-optimal for detecting the novel examples. As a motivating problem, consider a case where our datasets are drawn from two different mixtures of latent sub-populations {\color{cyan} TODO: add refs to Sagawa, Duchi and other robust opt. works}. Denoting the distributions corresponding to sub-populations by $\{G_i\}_{i=1}^{K}$ for some $K\in{\sN}$, and the probability simplex over $[K]$ by $\Delta^{K-1}$,
% \begin{align} \label{eq:varying_mixtures}
% \Psource = \sum_{i=1}^{K}{\gamma_i G_i}, ~ \Plabel{0} = \sum_{i=1}^{K}{\hat{\gamma_i} G_i}, ~ \rvgamma, \hat{\rvgamma}\in{\Delta^{K-1}}.
% \end{align}

\subsection{Necessary and Sufficient Assumptions for Learning}\label{sec:dist_assum}
Moving towards a principled approach for OOD Novel Category Detection, our first challenge is that 
% Tackling the problem above with standard empirical risk minimization algorithms is not straightforward since 
we do not have access to samples from $\Plabel{0}$ and $\Plabel{1}$, hence $R^{l_{01}}_{\gT}(h)$ cannot be estimated from data. It is easy to show that without any distributional assumptions, guarantees on the performance of a learning algorithm cannot be derived. We state this below and give the proof in \cref{sec:identifiability}.
\begin{restatable}{proposition}{impossibility}
Let $\gA$ be a learning algorithm for the task of OOD novel category detection. There are distributions $\Psource, \Plabel{0}, \Plabel{1}$ such that $\exists h^*\in{\gH}$ for which $R^{l_{01}}_{\gT}(h^*)=0$, while $\E_{S_{\gS}, S_{\gT}}\left[ R^{l_{01}}_{\gT}(\gA(S_{\gS}, S_{\gT})) \right] \geq 0.5$.
\label{prop:impossibility}
\end{restatable}
Since it is impossible to guarantee better-than-chance performance for a learning algorithm, several distributional assumptions have been formulated in the literature under which learning is possible.

\textbf{No distribution shift scenario} When $\Psource=\Plabel{0}$, assumptions like irreducibility, which says that $\Plabel{1}$ cannot be written as a mixture of $\Psource$ and another distribution, enable identification of $\alpha$ and learning \citep{blanchard2010semi}. Stricter assumptions can help devise more efficient algorithms, e.g. \citep{pmlr-v38-scott15, garg2021mixture}, but they are insufficient once we consider distribution shifts. 
% \vspace{-12pt}

\textbf{Known subpopulations and invariance of order} More recent works \citep{garg22adaptation, shanmugam2021quantifying, jain2020class}, consider the subpopulation shift scenario of \Cref{eq:varying_mixtures} where \emph{the subgroups are known}, or the learner is given a sample from each $G_i$. Once subgroups are known, some variations on the assumptions for the no-distribution-shift scenario enable learning, and methods such as reweighting and resampling can counteract the effects of the shift to obtain learning algorithms. In contrast, in this work we ask what can be done in cases where the subgroups are unknown to the learner. Another type of assumption that has been explored is ``invariance of order" \citep{kato2018learning, he2018instance}, which can roughly be summarized as $\Psource(\rvx) > \Ptarget(\rvx) \Rightarrow \Plabel{0}(\rvx) > \Plabel{1}(\rvx)$. Meaning examples that are more likely in the source distribution are less likely to be novelties. This type of assumption is unsuitable for our goals, as it entails an asymmetry between $\Psource$ and $\Plabel{0}$. For instance, it is reasonable to expect that detection of a novel subgroup of patients is possible regardless of whether it has been introduced in hospital A (corresponds to $\Psource$), or hospital B (resp. $\Plabel{0}$). This type of symmetry is denied by the assumption on orderings.
% \vspace{-12pt}

\textbf{Separability} The closest assumption to ours is separability, which says that the support of $\Psource$ must be disjoint from that of $\Plabel{1}$ and fully overlap with that of $\Plabel{0}$. Showing that the mixture proportion can be recovered, given perfect knowledge of $\Psource$ and $\Ptarget$ is rather straightforward (see \cref{sec:identifiability}), and learning with infinitely large samples can also be done. \citet{bekker2019beyond} propose using the propensity score, $\Psource(\rvx) / \left( \Psource(\rvx) + \Plabel{0}(\rvx) \right)$ to augment and reweigh $\datasource$, forming a debiased risk minimization problem.
\footnote{This expression for the score assumes a uniform prior on being sampled from the source vs. target distribution. Generally, the score is the probability that $\rvx$ was sampled from $\Psource$.}
\citet{gerych2022recovering} show that under separability, the propensity score can be identified from data. They do not provide finite sample guarantees, and the method requires solving a challenging density ratio approximation problem. Our contributions include a relaxed version of the separability assumption, which leads to finite sample generalization bounds and a learning rule that is markedly different from approaches based on estimating importance scores.