\section{Discussion, Limitations and Future Work} \label{sec:discussion}
We proposed a constrained learning approach for OOD novel category detection, based on a distributional assumption that bounds the shift in probability of rare events. A potential use of our method is in ML safety, where by detecting novel groups that were not part of our historical data, we may alert practitioners to issues that require further analysis. This complements methods that detect other types of safety issues such as error cases \citep{d2022spotlight, eyuboglu2022domino, singla2021understanding}, under-performing subgroups \citep{subbaswamy2021evaluating}, and OOD-detection methods that provide alerts on single examples instead of classes \citep{ruff2021unifying}. 

Our formal framework is based on PU-learning and our method on advances in rate-constrained optimization. Early literature on the PU-learning problem (without distribution shift) recognizes that constrained optimization may be a useful approach, yet forgoes this path since it seems like a challenging optimization problem \citep{liu2002partially} (mixture proportion estimation based on trade-offs between recall and FPR has been explored more extensively \citep{blanchard2010semi, pmlr-v38-scott15, jain2016estimating, jain2016nonparametric}). Later it has been shown that unconstrained risk minimization techniques may be devised to solve PU-learning problems under the SCAR assumption \citep{elkan2008learning, duplessis2014analysis}, which seems to make constrained optimization unnecessary. Our work claims that without the SCAR assumption, a constrained learning approach can be beneficial. Importantly, we show that for our constrained learning rule, formal guarantees can be derived in settings where to the best of our knowledge, learnability in the sense of \cref{def:prob_setting} has not been shown.

Our approach has some limitations. The choice of hyperparameter $\beta$ should be done carefully and requires reasoning about properties of the groups we hope to detect. \cref{thm:main_result} provides guidance in cases where we have a good approximation of $\beta(h^*)$, for instance when we are willing to assume that separability (approximately) holds, which is a reasonable assumption in many applications. In experiments, the performance of our method is still favorable w.r.t baselines when $\beta$ is not fine-tuned. This is encouraging, yet it does not prove that such insights generalize to all real-world scenarios. Other aspects of \cref{alg:conoc} can likely be improved, such as replacing line search over $\boldsymbol{\alpha}$ with other approaches for hyperparameter tuning, and experimenting with more sophisticated constrained optimization algorithms than the alternating primal-dual steps we use in our implementation.
% Some data-driven tools for the task of estimating $\beta$ can be further explored. For instance, using an additional dataset, $S_{\text{aux}}$, sampled from another distribution where distribution shifts occur (e.g. as in \Cref{eq:varying_mixtures}) but we know novel subgroups were not introduced. if we have an additional approximating the prevalence of rare events from the source data $\datasource$, or perhaps with an additional dataset $S_{\text{aux}}$ that is sampled from another distribution where distribution shifts occur (e.g. as in \Cref{eq:varying_mixtures}), but we know novel subgroups were not introduced.

% Other useful tools that can be developed are approximations to ratios such as $\alpha(h) / \beta(h)$ from finite samples, where intuitively, high ratios can be attributed to novelties.
% can also pose a limitation is the generality of our assumption, namely the bound on frequency of rare events (\cref{assum:unicorn_bound}). While the assumption is rather non-restrictive, which is an attractive property on the one hand,
\cref{assum:unicorn_bound} on the frequency of rare events is rather non-restrictive and is likely to hold in several cases of interest. On the other hand, its generality 
also means it is not tailored towards other types of distribution shifts. For instance, recent works on PU-learning make structural assumptions on the distribution shift \citep{garg22adaptation, shanmugam2021quantifying} that are very different from ours and can be useful. Combining different types of assumptions into a rich framework for novelty detection under distribution shift is an exciting avenue for future research. Extensions to settings such as time-series and multiple data sources is also an exciting future direction. Recent works on invariance and stability under distribution shifts offer structural frameworks that would be interesting to explore in the context of novelty detection \citep{peters2016causal, arjovsky2019invariant, subbaswamy2019preventing, subbaswamy2021evaluating, puli2022outofdistribution,  wald2021calibration}. We hope that this paper encourages further work on novelty detection in changing environments with guarantees on their performance.

% several aspects of \ours can be further explored, for instance the model selection criterion which performs reasonably well in experiments can likely be improved. Our selection criterion is based on a Clopper-Pierson interval that is rather sensitive to finite sample effects, and tighter bounds on the recall $\alpha(h)$ can be useful in that sense.

% Some works on anomaly detection and PU-learning are related to the latter problem \citep{garg2021mixture, liu2018open} and it would be interesting to see if these techniques can be adapted to our goals.

% As for the limitation on adaptivity to different distribution shifts, recent works on PU-learning make structural assumptions on the distribution shift \citep{garg22adaptation, shanmugam2021quantifying} that are very different from ours, and combining the two types of assumptions into a rich framework for novelty detection under distribution shifts is an exciting avenue for future research. 