\section{\ours: A Constrained Learning Method for OOD Novel Category Detection} \label{sec:algorithm}
%Solving \Cref{eq:precision_at_recall} with flexible hypothesis classes such as neural networks and decision trees seems like a challenging problem. 
Most computationally efficient gradient methods and classical ML theory results on statistical efficiency apply to standard risk minimization problems, hence applying them to our problem of solving \Cref{eq:precision_at_recall} is not straightforward. Fortunately, recent literature on fairness and constrained learning presents effective tools and beautiful theory to tackle this type of problem \citep{eban2017scalable, chamon2022constrained, cotter2019optimization, cotter2019training, donini2018empirical, pmlr-v80-agarwal18a, pmlr-v65-woodworth17a}. In this section we adapt these methods and insights to tackle our novelty detection problem and arrive at a constrained learning approach, that we call {\ours} (\textbf{Co}nstrained \textbf{No}vel \textbf{C}ategory detection).

% Define the empirical versions of $\beta(h), \alpha(h)$ w.r.t to a loss function $l:\sR\times \{0, 1\}\rightarrow \sR$ as:
% \begin{align*}
% \hat{\beta}^l(h) = \frac{1}{n_{\gS}}\sum_{\rvx\in{\datasource}}{l(h(\rvx), 0)}, \hat{\alpha}^l(h) = \frac{1}{n_{\gT}}\sum_{\rvx\in{\datatarget}}{l(h(\rvx), 1)}.
% \end{align*}
In terms of formal guarantees, constrained learning methods offer attractive bounds on the optimization error for solving \eqref{eq:precision_at_recall}, whereas \cref{thm:main_result} provides statistical guarantees. By directly plugging in optimization error terms to the bound of \cref{thm:main_result}, error bounds on the complete procedure can be derived. Since combining these results does not require any novel insight or technique, we dedicate the rest of this section to present parts of the method we use in practice which depart from the algorithms discussed in the works above. Our implementation of a constrained learning optimization algorithm uses a simple primal-dual optimization approach with alternating gradient steps where one player controls the model parameters, and the other controls a Lagrange multiplier for the rate constraint. Many further improvements and variations are possible, and we refer the interested reader to \citet{cotter2019optimization, cotter2019training, pmlr-v65-woodworth17a, chamon2022constrained, pmlr-v80-agarwal18a} for details on a variety of optimization algorithms and their guarantees.

\subsection{Detecting Novel Categories in Practice}
% To retrieve approximate solutions to the empirical optimization problem, we use Lagrangian optimization methods from \citep{cotter2019optimization, chamon2022constrained}.
Empirically, we find that estimating the solution to \Cref{eq:precision_at_recall} directly with Lagrangian Optimization delivers poor results. Intuitively, this happens sinec for a loss function $l:\sR\times\{0, 1\}\rightarrow \sR$, maximizing
$\hat{\alpha}^l(h) = \sum_{\rvx\in{\datatarget}}{l(h(\rvx_i), 1)}$ fits noisy labels to mixed data. That is, the dataset $\datatarget$ contains both novel and non-novel points, trying to fit as many of them with $y=1$ (i.e. labelling them as novelties) results in overfitting. On the other hand, minimizing $\hat{\beta}^{l}(h)$ fits correct labels since examples from $\datasource$ do not belong to the novel category, and we observe that this inhibits overfitting.
% Moreover, constraining $\hat{\beta}^{l_{01}}(h)$ to low error values is numerically unstable.
Hence we find that constraining $\hat{\alpha}^{l_{01}}(h)$ while minimizing $\hat{\beta}^l(h)$ works much better than a direct implementation of \cref{eq:precision_at_recall} which constrains $\hat{\beta}^{l_{01}}(h)$ while maximizing $\hat{\alpha}$. As we explain shortly, this will be performed with various values to constrain $\hat{\alpha}^{l_{01}}(h)$.

To obtain solutions for an optimization problem of the form 
\begin{align} \label{eq:flipped_opt}
    &\min_{h\in{\gH}}{\hat{\beta}(h)} \\
    &\text{s.t. }\hat{\alpha}(h) \geq \tilde{\alpha}, \nonumber
\end{align}
where $\tilde{\alpha}>0$ is some threshold on the empirical recall, define the Lagrangian
\begin{align*}
    \gL_{\hat{\alpha}}(h, \lambda, \datasource, \datatarget) = &{n^{-1}_{\gS}}\sum_{\rvx\in{\datasource}}{l_{\mathrm{log}}(h(\rvx), 0)} \\
    &+ \lambda\cdot \left[ {n^{-1}_{\gT}}\sum_{\rvx\in{\datatarget}}{l_{\sigma}(h(\rvx))} - \hat{\alpha} \right].
\end{align*}
We replace the $0-1$ loss over $\datasource$ with a surrogate log-loss (denoted by $l_{\text{log}}$), and the loss in the constraints with a sigmoid (resp. $l_{\sigma}$) which past work found to be effective for differentiable approximation of the indicator function in several problems, including rate-constrained optimization \citep{chamon2022constrained, goh2016satisfying, maddison2017the, jang2017categorical}. We optimize this Lagrangian with alternating gradient steps over the parameters of $h$ and $\lambda$.
% {\color{cyan} discuss details of Lagrangian Optimization, perhaps add an algorithm box and a reference to a formal statement about the algorithms in prior work. Consider consolidating all the claims into one main statement about the entire procedure.}

In summary, beyond the Lagrangian optimization procedure, our proposal for a practical algorithm includes two important components. One is a line search on the value of $\alpha$, where in practice we simply solve problems with several values $\tilde{\alpha}$ that constrain $\hat{\alpha}(h)$ in \cref{eq:flipped_opt}. The second component is model selection using a validation set. For each learned model we approximate its error rate on $\Psource$, and its recall w.r.t $\Ptarget$ (treating the target data as positively labelled) using a validation set.
We then select the hypothesis $h$ that achieves highest empirical recall ($\hat{\alpha}(h)$) whose empirical error on $\Psource$, $\hat{\beta}(h)$, does not exceed the user-provided value $\beta$. Hence our model selection is dictated by \Cref{eq:precision_at_recall} which is the overall objective of our algorithm.\footnote{Note that in principle, if we consider $h^*$ that solves \cref{eq:precision_at_recall}, and $h^{\text{dual}}$ that solves \cref{eq:flipped_opt} with $\hat{\alpha}$ set to $\alpha(h^*)$ then we can show $h^{\text{dual}}$ is also be optimal for \cref{eq:precision_at_recall}. Hence our procedure is indeed an approximate solution to \cref{eq:precision_at_recall}} The procedure is summarized in \Cref{alg:conoc}.
% Several components in the procedure, e.g. the specific Lagrangian optimization method and exact model selection criterion can be implemented in several ways, and we discuss some of them in \Cref{sec:discussion}.
Let us turn to evaluate the performance of our method.
\attendto{Is this exposition of the algorithm tiring and exhausting to read? am I overusing footnotes?}

\begin{algorithm}[t]
\onehalfspacing
\caption{\ours: Constrained Learning for OOD Novel Category Detection}
\label{alg:conoc}
\begin{algorithmic}[1]
\STATE {\bfseries Input:} datasets $\datasource, \datatarget$, hypothesis class $\gH$, target FPR $\beta > 0$ and search range $\boldsymbol{\alpha}\in{[0,1]^L}$.
\STATE Draw validation set $V_{\gS}, V_{\gT}$ from $\datasource, \datatarget$ respectively
\FOR{$\alpha\in{\boldsymbol{\alpha}}$}
    \STATE Train model $h_{\alpha}$ to solve \cref{eq:precision_at_recall} using primal-dual optimization.
    \STATE Calculate approx. FPR $\hat{\beta}(h_{\alpha}) = \frac{1}{|V_{\gS}|} \sum_{\rvx\in{V_{\gS}}}{h_{\alpha}(\rvx)}$, and recall $\hat{\alpha}(h_{\alpha}) = \frac{1}{|V_{\gT}|} \sum_{\rvx\in{V_{\gT}}}{h_{\alpha}(\rvx)}$.
    % thresholded hypothesis $h_{\alpha, t}(\rvx) = \1_{h(\rvx) > t}$ s.t. $P(\beta(h_{\alpha, t}) > \beta) \leq \delta$, using interval calculated w.r.t $V_{\gS}, V_{\gT}$.
\ENDFOR
\STATE return $\mathrm{arg}\max_{h_{\alpha}: \alpha \in{\boldsymbol{\alpha}}, \hat{\beta}(h_{\alpha}) < \beta}{\hat{\alpha}(h_{\alpha})}$ %\sum_{\rvx\in{V_{\gT}}}{h_{\alpha}(\rvx)}
% {\color{cyan} Question for experimental part: should we ablate this model selection part and the Lagrangian training?}
\end{algorithmic}
\end{algorithm}

% In what follows we develop our constrained learning approach for the novel class detection problem under distribution shift. We begin by presenting a bound on the classification error of the new class, and a distributional assumption which guarantees the bound is a good approximation for the true error **TODO: reference relevant equation here**. The assumption boils down to asking that ``rare events" in $\Psource$ (defined by a threshold on probabilities of the event) cannot become highly probable in $\Ptarget$. Motivated by this bound, we aim to learn models that achieve high recall $\hat{\alpha}$ while maintaining some low pre-specified False-Positive Rate (FPR) on the problem of separating $\datasource$ and $\datatarget$. Once this goal is established, we present the constrained learning approach to the problem which we dub \ours (\textbf{Co}nstrained \textbf{No}vel \textbf{C}lass detection), and derive the appropriate generalization bounds that depend on the complexity of our hypothesis class $\gH$. We note that it would be challenging to draw such bounds for methods based on density-ratio estimation, as sample complexity for density ratios typically depend on the dimension of the features.


% and show how the mixture proportion can be estimated based on a Best Bin Estimation procedure similar to that of \citet{garg2021mixture}.

% \subsection{Necessary and Sufficient Assumptions for Learning}\label{sec:dist_assum}
% It is easy to see that no meaningful guarantee on the error of a learning algorithm can be derived for our problem, without making distributional assumptions that restrict $\Plabel{1}$ to $\Plabel{0}$. Consider a toy problem with distributions over $3$ states, where $\Psource = [1-\epsilon, \epsilon, 0]$ and $\alpha=0.5$. Let us examine two cases for the ground-truth data generating process, one given by $\Plabel{0} = [0, 1-\epsilon, \epsilon], \Plabel{1} = [0, \epsilon, 1-\epsilon]$ for some small $\epsilon > 0$, and the other where we switch the labels $\tilde{P}_{\gT, 0} = [0, \epsilon, 1-\epsilon], \tilde{P}_{\gT, 1} = [0, 1-\epsilon, \epsilon]$. Even in the ideal case where a learning algorithm has access to the true $\Psource$ and $\Ptarget$, and not just a finite sample, it cannot guarantee accuracy above . This means $\Ptarget = \alpha\Plabel{0} + (1-\alpha)\Plabel{1} = \alpha\tilde{P}_{\gT, 0} + (1-\alpha)\tilde{P}_{\gT, 1}$, hence a PU-learning algorithm that receives $\Psource, \Ptarget$ gets the same input in both scenarios. On the other hand, it is easy to see that any hypothesis $h$ that achieves small error when the true distributions are $\Plabel{0}, \Plabel{1}$ (e.g. that returns $1$ for the third state and $0$ for the others, achieving $R_{\gT}(h) = \epsilon$), thus solving the problem with small error. On the other hand, if we switch the label in the target distribution and let $\Plabel{0} = [0, \epsilon, 1-\epsilon], \Plabel{1} = [0, 1-\epsilon, \epsilon]$, then the same hypothesis achieves error $1-\epsilon$. It easy to see that for any hypothesis that achieves low error on the first problem, will obtain high error for the second. Show that switching the roles of the $Y=0$ and $Y=1$ leads to the same input to the learner, and hence the problem is unidentifiable
% Established assumptions in the literature on our problem are separability

% Before delving into the related literature, we should note that to give any meaningful result about the identifiability of $\alpha$ and of the novel class, some assumptions must be made. The main assumption we must make has to relate $P_B$ and $P_A$ in some manner. The only thing we can tell about $P_B$ in the absence of the SCAR assumption is that it is a mixture of the new class and \emph{some} distribution $P_0$ (i.e. $P_B = (1-\alpha) P_0 + \alpha P_1$, and $P_0\neq P_A$). Then perhaps the most generic way to relate these distributions is by common membership in a known uncertainty set of distributions $\gP$ (i.e. $P_0, P_A \in{\gP}$). Thus depending on our definition of $\gP$ we can recover several types of reasonable assumptions. In this paper we will mainly be interested in the case where $P_A \in \mathrm{relint}{\gP}$ (i.e. $P_A$ has non-zero mass wherever some $P\in{\gP}$ has non-zero mass) and the support of $P_1$ does not overlap with the support of any distribution in $\gP$. This will let us recover the novel group under very weak assumptions on $\mathrm{relint}{\gP}$ and hence gives a widely applicable algorithm. Even though this is a rather strong assumption, we will show that the methods we develop give meaningful results in different cases where the assumption does not hold. In \Cref{sec:beyond_no_overlap} we will discuss how the results of our proposed method should be interpreted in practice, and how future work can move beyond the settings we discuss here.

% Describe algorithm, explain constrained learning, give generalization bound on maximizing precision at recall.