\section{Introduction}
Distribution shifts occur in most real-world scenarios where machine learning (ML) is deployed and these shifts can result from both natural and adversarial changes including differences in data recording protocols, shifts in the underlying population being monitored, or the way the ML tool is being used \citep{pmlr-v139-koh21a, finlayson2021clinician, quinonero2008dataset, saria2019tutorial}. While some shifts do not pose an immediate safety concern, others warrant examination and proper treatment. In this paper, we are concerned with potential risks that arise from the emergence of a novel category or subgroup and will study guarantees around the automated detection of such subgroups under practical, real-world assumptions.

As motivation, consider dataset shift scenarios in the healthcare domain \citep[Table~1]{finlayson2021clinician}. At the start of the COVID-19 pandemic, a Michigan hospital described how a predictive tool for catching patients at-risk for a life-threatening complication called sepsis started to over-alert and incorrectly flag patients as the underlying population shifted \citep{finlayson2021clinician}. Ultimately they had to turn off the tool because of the harms it posed to patients. In this scenario, the tool was scanning and providing predictions on new patient groups (e.g., patients with likely COVID-19) which led to the safety issue. \footnote{A trivial solution might be to filter out new likely subgroups including COVID-19 patients from the list of patients that the tool was allowed to make predictions on. However, because patient diagnoses were not available upon presentation to the hospital, this was inadequate.} 

In this paper we tackle the problem of \emph{Out-of-Distribution (OOD) Novel Category Detection} (also called novel class, or subgroup). We aim to identify novel instances within a dataset that contains both known and novel categories. What sets our approach apart is that we not only account for the introduction of a new category but also allow \emph{other distribution shifts between the new and previously observed data}. This aspect is of utmost importance when it comes to monitoring risks in real-world applications, as distributions tend to change over time and new data continually emerges.

\begin{comment}
Our goal in this paper is to develop a method with guarantees on classification error of the novel subgroup, that hold under a wide range of distribution shifts. Formal guarantees are particularly important in safety-critical applications, where these can help substantiate trust and confidence by users and regulators. Many approaches have been devised for detection of novel subgroups, mainly in the Open-Set Domain Adaptation literature (e.g. \citep{Busto_2017_ICCV, han2019learning, xu2019open}). These methods are often applied in complex scenarios, yet give little to no theoretical guarantees. On the other hand, methods that are rigorously justified apply to scenarios that are either substantially simpler \citep{blanchard2010semi, garg22adaptation} than the ones we pursue here, or less suitable for novel subgroup detection \citep{he2018instance}. To derive guarantees in a setting that captures many relevant scenarios, we formulate our problem of OOD novel category detection as a learning problem from Positive and Unlabelled data \citep{he2018instance, kato2018learning, bekker2019beyond}. Intuitively, we may think of new data as unlabelled, since each example can either belong to a novel subgroup or not, while past data is labelled as coming from a familiar distribution. In contrast with prior work, a crucial element of our formal treatment is that \emph{besides the addition of a new subgroup, we allow for other distribution shifts between the new and previously observed data}. Returning to the healthcare scenario described above, besides the introduction of new patient subgroups related to COVID-19, the baseline population itself shifted because the types of patients coming into the hospital evolved over the course of the pandemic. Early on, only those with urgent needs visited. Over time, those with longer term needs and planned surgeries began to use the hospital. Requiring the baseline distribution to remain constant over time is a highly restrictive assumption and in many real-world settings, we need the ability to detect novel categories and provide guarantees without enforcing this assumption \citep{pmlr-v139-koh21a, finlayson2021clinician}. That is, the ability to work Out of Distribution (OOD).
%this is a highly plausible scenario in real world applications when we collect data in different locations and times.
We make the following contributions:
\end{comment}
Returning to the healthcare scenario described above, besides the introduction of new patient subgroups related to COVID-19, the baseline population itself shifted because the types of patients coming into the hospital evolved over the course of the pandemic. Early on, only those with urgent needs visited. Over time, those with longer term needs and planned surgeries began to use the hospital. Requiring the baseline distribution to remain constant over time is a highly restrictive assumption and in many real-world settings, we need the ability to detect novel categories without enforcing this assumption \citep{pmlr-v139-koh21a, finlayson2021clinician} (that is, the ability to work OOD). To this end, we develop a method with guarantees on classification error of the novel category, that hold under a wide range of distribution shifts. Formal guarantees are particularly important in safety-critical applications, where these can help substantiate trust and confidence by users and regulators. Many approaches have been devised for detection of novel categories, mainly in the Open World learning literature (e.g. \citep{Busto_2017_ICCV, han2019learning, xu2019open}). These methods are often applied in complex scenarios, yet give little to no theoretical guarantees. On the other hand, methods that are rigorously justified apply to scenarios without distribution shifts, and are hence substantially simpler \citep{blanchard2010semi, garg22adaptation, liu2018open}, or less suitable for novel category detection than the setting we pursue here \citep{he2018instance}. We make the following contributions:
\begin{itemize}[leftmargin=*]
    \item We propose a new learning algorithm for the problem of OOD novel category detection (i.e. novel category detection when the baseline distribution shifts). The method builds on approaches for constrained learning \citep{eban2017scalable, pmlr-v80-agarwal18a, chamon2022constrained, cotter2019training, donini2018empirical} and seeks to maximize the number of points correctly detected as novel, while keeping false detections below a certain rate.
    \item We provide guarantees on the error of the learned model that hold under a certain assumption, namely, that rare events in past data have bounded frequency under the new distribution. Prior works either provide much weaker guarantees or rely on stringent assumptions. Works that study the label-shift scenario assume that the only change is in frequency of known and labelled subgroups \citep{garg22adaptation, shanmugam2021quantifying}. In our healthcare scenario, such methods require defining all possible patient subgroups that can shift, labelling the membership of patients in them, and accurately estimating the change in their frequency. Methods based on this strong assumption can also become impractical considering the tedious labelling and amount of data required. Other approaches require access to perfectly accurate density ratios between the distribution of past and current data (or propensity scores, that cannot necessarily be estimated from data) \citep{bekker2019beyond, gerych2022recovering, jain2020class}, which limits both theoretical guarantees and their performance in many settings, for instance those involving high-dimensional data where density ratio estimation is challenging \citep[Chapter~8]{sugiyama2012density}. \attendto{TODO: this last sentence feels like a hanging fragment; eg "statistical assumptions such as label shift and access to infinite data" doesn't compile.} \yw{A bit long, but hopefully better now?}
    
    % Making assumptions is unavoidable in our problem, since detection of novel subgroups under distribution shift cannot be solved without them \citep{bekker2020learning} (see proposition {\color{cyan} reference to the proposition I need to write in sec 2}). Our assumption is derived from an upper bound on the error of a hypothesis, where we assume a bound on a term that cannot be estimated from data (which bears similarity to the $\gH$-divergence of \citet{bendavid2010adaptation, kifer2004detecting}). The algorithm is derived by minimizing this upper.
    
    
    % draw a bound on the error of a hypothesis in classifying the novel class vs. the rest of the data. Since generally, the learning problem of interest is not solvable without distributional assumptions \citep{bekker2020learning}, the bound contains a term that cannot be estimated from data, which bear similarity of the $\gH$-divergence of \citet{bendavid2010adaptation, kifer2004detecting}. Therefore we reason about this quantity with an assumption that bounds the probability of rare events in past data occurring in our new distribution.
    % \item Motivated by the bound, we suggest a rate-constrained learning algorithm \citep{pmlr-v80-agarwal18a, chamon2022constrained, cotter2019training, donini2018empirical} to optimize it. We show generalization bounds on solutions to the empirical version our learning rule using standard notions of complexity of hypothesis classes. Under the assumptions we lay out, the bounds translate to guarantees on the task of identifying the novel subgroup.
    \item Finally, we show favorable performance of the algorithm on challenging novel category detection tasks that we simulate over real world datasets.
\end{itemize}
