\section{Experiments} \label{sec:experiments}
% \begin{figure*}[t]
% \centering
% % % \begin{minipage}{10cm} 
% % \begin{table}
% \begin{tabular}{l | cc | cc} \toprule 
%         \multirow{3}{*}{Algorithm} & \multicolumn{2}{c}{Tabula Muris} & \multicolumn{2}{c}{MIMIC-III} \\
%         & $\Delta_{\text{best}}$ AU-ROC & Rel-AMPE $\left( | 1 - \hat{\alpha} / \alpha^*| \right)$ & $\Delta_{\text{best}}$ AU-ROC & RAE-MPE\\
%         & (wins / reps.) & Rel-MPE $\left( 1 - \hat{\alpha} / \alpha^* \right)$ & (wins / reps.) & $1 - \hat{\alpha} / \alpha^*$ \\ \midrule \midrule
%         \multirow{2}{*}{Disc. + Elkan-Noto} & & & 0.791 & \\
%         & & & & \\ \midrule
%         \multirow{2}{*}{Propensity} & & & 0.807 & \\
%         & & & & \\ \midrule
%         \multirow{2}{*}{\ours} & & & 0.817 & {\color{cyan} 0.05} \\
%         & & & &
% \end{tabular}
% \caption{Average difference from best method in Area Under the Receiver-Operator Curve, AU-ROC $\Delta_{\text{best}}$, and Relative Absolute Error in Mixture Proportion Estimation (RAE-MPE) for evaluated methods. Difference from best method is reported instead of mean AUC since drawn distribution shifts differ. \color{cyan}{Bottom lines are standard deviations across different draws of data and novel classes. TODO: fill in after running repetitions}} \label{tbl:results_main}
% % \end{table}
% % \end{minipage}
% \end{figure*}

\begin{table*}[t]
\centering
% \begin{tabular}{l | cc | cc} \toprule 
%         \multirow{3}{*}{Algorithm} & \multicolumn{2}{c}{Tabula Muris} & \multicolumn{2}{c}{MIMIC-III} \\
%         & $\Delta_{\text{best}}$ AU-ROC & Rel-AMPE $\left( | 1 - \hat{\alpha} / \alpha^*| \right)$ & $\Delta_{\text{best}}$ AU-ROC & RAE-MPE\\
%         & (wins / reps.) & Rel-MPE $\left( 1 - \hat{\alpha} / \alpha^* \right)$ & (wins / reps.) & $1 - \hat{\alpha} / \alpha^*$ \\ \midrule \midrule
%         \multirow{2}{*}{Disc. + Elkan-Noto} & & & 0.791 & \\
%         & & & & \\ \midrule
%         \multirow{2}{*}{Propensity} & & & 0.807 & \\
%         & & & & \\ \midrule
%         \multirow{2}{*}{\ours} & & & 0.817 & {\color{cyan} 0.05} \\
%         & & & &
% \end{tabular}
\begin{tabular}{l | cc } \toprule 
        \multirow{3}{*}{Algorithm} & \multicolumn{2}{c}{$\text{AU-ROC}/\text{AU-ROC}_{\text{best}}$} \\
        & \multicolumn{2}{c}{(wins $|$ reps.)} \\
        & MIMIC-III & Tabula Muris \\ \midrule \midrule
        \multirow{2}{*}{Domain Disc.} & $0.940 \pm 0.035$ & $0.954 \pm 0.035$  \\
        & (1 $|$ 15) & (0 $|$ 8)  \\ \midrule
        \multirow{2}{*}{Propensity} & $0.959 \pm 0.028$ & $0.953 \pm 0.046$  \\
        & (2 $|$ 15) & (2 $|$ 8) \\ \midrule
        \multirow{2}{*}{\ours} & $\mathbf{0.999} \pm 0.001$ & $\mathbf{0.988} \pm 0.020$ \\
        & \textbf{(12 $|$ 15)} & \textbf{(6 $|$ 8)}
\end{tabular}
\begin{tabular}{| cc } \toprule 
         \multicolumn{2}{|c}{$\text{AU-PRC}/\text{AU-PRC}_{\text{best}}$} \\
        \multicolumn{2}{| c}{(wins $|$ reps.)} \\ %\multicolumn{3}{|c}{($\mathrm{sign}\left[ 1 - \hat{\alpha} / \alpha \right]$)} \\
        MIMIC-III & Tabula Muris \\ \midrule \midrule
        $0.797 \pm 0.098$ & $0.815 \pm 0.173$ \\
         (1 $|$ 15) & (1 $|$ 8)  \\  \midrule
       $0.854 \pm 0.064$ & $0.428 \pm 0.235$ \\
        (1 $|$ 15) & (0 $|$ 8) \\ \midrule
          $\mathbf{0.995} \pm 0.015$ & $\mathbf{0.999} \pm 0.001$ \\
          \textbf{(13 $|$ 15)}&  \textbf{(7 $|$ 8)} 
\end{tabular}
\caption{Average Relative Area Under the Receiver-Operator Curve, $\text{AU-ROC}/\text{AU-ROC}_{\text{best}}$, where at each repetition $\text{AU-ROC}_{\text{best}}$ is taken as the area for the best method and $\text{AU-ROC}$ is that of the evaluated method. Relative performance to best method is reported instead of raw AU-ROC since performance under different drawn distribution shifts varies. We also present the Relative Average Precision in the same manner, to summarize the Precision-Recall curve.
} \label{tbl:results_main}
\end{table*}
We evaluate {\ours} in two real-world large and high-dimensional datasets.

\textbf{Experimental Setting.} For each dataset we have features $S=\{\rvx_i\}_{i=1}^{N}$ that are available to the learner and labels $\{y_i\}_{i=1}^{N}$ that are not. These labels are used to set up the novel categories and distribution shifts in our experiments. The procedure for each experiment is as follows; From a set of possible labels $\mathcal{Y}$, we choose $y_{\text{novel}}\in{\gY}$, and collect all examples that belong to the group $\gI = \{i: y_i=y_{\text{novel}}\}$ into a dataset $S_{\text{novel}}=\{\rvx_i\}_{i\in{\gI}}$. This divides our data into disjoint subsets $S_{\text{novel}}$ containing the novel category and $S_{\text{seen}} = S\setminus S_{\text{novel}}$ containing the rest of the examples.
We further split $S_{\text{seen}}$ into disjoint subsets $S_{\gS}$ and $S_{\gT,0}$, where we create a sub-population shift (see \cref{eq:varying_mixtures}) between these two subsets by randomly drawing the prevalence of each subgroup in $\gY \setminus \{y_{\text{novel}}\}$ (see further details in the \cref{app:additional_results}).\footnote{In MIMIC-III, each example has multiple labels, hence notation here is slightly abused. This is also detailed in the appendix.} Then each algorithm is run with the datasets $S_{\gS}$ and $S_{\gT} = S_{\gT, 0} \cup S_{\text{novel}}$ as inputs (this means the true mixture proportion is $\alpha = |S_{\text{novel}}| / (|S_{\gT,0}| + |S_{\text{novel}}|)$). We repeat this procedure, creating a different subpopulation shift each time.


% \vspace{-10pt}
\textbf{Baselines and evaluation metrics.} We compare {\ours} with the algorithm proposed in \citep{gerych2022recovering} based on propensity weighting \citep{bekker2019beyond} (the idea is to estimate the density ratio of $\Psource$ and $\Ptarget$ and use it as importance weights, see \Cref{app:additional_results} for details). We choose to present results for this method since it outperforms other relevant baselines (e.g. a clustering based approach of \citet{jain2020class}), and since other methods for biased PU-learning \cite{kato2018learning, he2018instance} are based on assumptions that do not hold in our setting. Our second baseline is a domain discriminator, trained to distinguish between $\datasource$ and $\datatarget$, which forms the basis for many PU-learning techniques, e.g. \citep{elkan2008learning, duplessis2014analysis, garg2021mixture}. 

To calculate metrics such as accuracy, precision and recall, we need to obtain binary predictions of whether examples belong to $y_{\text{novel}}$ or not. In both baselines, this requires an approximation of $\alpha$ (an MPE), that should be incorporated into the classifier (e.g. by setting the appropriate decision threshold, see \citet[Sections~5.3, 6]{bekker2020learning}). 
% Therefore to compare metrics such as accuracy, precision or recall with the baselines, we would need to approximate $\alpha$ and use it to adjust the classification threshold.
Since most MPE methods are designed under the assumption that $\Psource=\Plabel{0}$ and there is no single method that is designed to perform well under a variety of distribution shifts, we evaluate methods with metrics for predictive ability that are independent of the decision threshold. In \cref{app:additional_results} we include MPE results, with two different techniques \citep{elkan2008learning, li2003learning} for the baselines ({\ours} does not require MPE, since we simply use the raw outputs for classification). Another point we take into account in choosing evaluation metrics is that for each repetition of the experiment, a different distribution shift is drawn. Therefore the ability of models to distinguish $y_{\text{novel}}$ from the rest of the data can vary between repetitions. In this case, comparison of raw metrics becomes less informative and relative metrics between the different methods are more appropriate. Taking together the above considerations, \cref{tbl:results_main} includes the following metrics.

We use Area Under the Receiver-Operator Curve (AU-ROC), and the Average Precision (Av.-Precision) as summaries of the ROC and Precision-Recall curves respectively, where the classification task is detection of 
$y_{\text{novel}}$ vs. $\gY \setminus \{y_{\text{novel}}\}$. At each round we take the AU-ROC for the best performing method, denoted by $\text{AU-ROC}_{\text{best}}$, and for each method calculate $\text{AU-ROC} / \text{AU-ROC}_{\text{best}}$ to get a relative measure of performance (respectively for Av.-Precision). We also include the number of rounds where each method turned out to perform best. The absolute $\text{AU-ROC}$ values for each repetition of the experiments are detailed in \cref{app:additional_results}.

\begin{comment}
{\color{cyan}{TODO: start from here tomorrow. Shortly describe MPE methods, explain they can work quite poorly under distribution shift. Say that this indeed happens in results, no single combo of MPE method and model gives the best proportion estimation. \ours does not require an extra step for MPE (we simply return the classifier's predictions), and in the data we worked with, they were comaprable to the best performing methods over the different settings. Copy explanation for using relative metrics instead of absolute ones.}}
For mixture proportion estimation, we apply two techniques on the classifiers trained by the baselines. A standard PU-learning estimator based on \citet{elkan2008learning}, and a rule we call $FPR > 0.01$ that sets the decision threshold where the False Positive Rate on the validation set is $0.01$. We include the latter since it is similar to our estimation technique, some similar selection rules have been used in the literature, e.g. \citet{li2003learning}. For \ours the estimator is the fraction of examples the classifier labels as positive in the validation target data.

To evaluate the methods we use two metrics, the Relative Area Under the Receiver-Operator Curve for detecting $y_{\text{novel}}$ vs. $\gY\setminus \{y_{\text{novel}}\}$ ($\text{AU-ROC} / \text{AU-ROC}_{\text{best}}$), and Relative Mixture Proportion estimation Error (RMPE). The latter is calculated by taking $1- \hat{\alpha} / \alpha$, where $\hat{\alpha}$ is the mixture proportion estimate for a model and $\alpha$ is the ground truth mixture proportion. Since at each repetition of the experiment a new dataset shift is drawn, the raw values of AU-ROC change considerably and they are not informative towards comparison between the methods. Therefore at each round we take the AU-ROC for the best performing method, denoted by $\text{AU-ROC}_{\text{best}}$, and for each method calculate $\text{AU-ROC} / \text{AU-ROC}_{\text{best}}$ to get a relative measure of performance. We report the average of this metric over all repetitions. The Relative AU-ROC captures the overall ability of the model to detect the new group, \footnote{AU-ROC is also the metric of choice for the MIMIC-III benchmark \citep{harutyunyan2910multitask}.} while the RMPE measures our ability to set a correct decision threshold for the classifier.
\end{comment}

\textbf{Datasets.} In the Tabula Muris single cell dataset \citep{tabula2020single}, the categories $\mathcal{Y}$ are cell types and features $\gX$ are gene expressions. Then the shift between $S_{\gS}$ and $S_{\gT, 0}$ is due to differing proportions of the observed cell types. This follows the experimental setting in \citet{garg22adaptation}, with the crucial difference that in our setting the learner does not observe cell types in $S_{\gS}$.
In the benchmark dataset devised by \citet{harutyunyan2910multitask} for MIMIC-III \citep{johnson2016mimic}, categories correspond to phenotypes (e.g., kidney disease, pneumonia, liver disease) and features are high-dimensional extracted statistics from time-series data, such as vitals and lab measurements, recorded over ICU stays (see \citet[Tables~2,3]{harutyunyan2910multitask} for list of phenotypes and features considered). The proportion of novel categories $\alpha$ within $\datatarget$ in these experiments is between $0.005$ and $0.06$, more details on these values and the effect of $\alpha$ on performance are in \cref{app:additional_results}.

\subsection{Results}
\Cref{tbl:results_main} shows that {\ours} performs favorably with respect to the baselines in terms of relative AU-ROC and Av.-Precision on both datasets. It is the best performing method in the vast majority of repeated experiments and a further examination of the results shows that when it is not, the gap in performance is very small (see \cref{app:additional_results} for more details).

% In terms of mixture proportion estimation, the best overall estimate is given by the method of \citet{elkan2008learning} with the standard logistic regression model. In these datasets, our choice of $\beta=0.01$ is a bit restrictive and even a classifier trained with true labels does not achieve a much lower false positive rate. Thus as expected from the discussion in \Cref{sec:guarantees}, the retrieved mixture proportion is a lower bound on the true proportion. We note that it is possible to use other mixture proportion estimation methods on top of the model trained by \ours, and we show these results in the appendix.

%\vspace{-12pt}
\textbf{Takeaways from experiments.} The results above demonstrate the effectiveness of constrained learning approaches in detecting novelties under distribution shift.
% while theory (see \Cref{sec:guarantees}) and further empirical analysis (see \Cref{app:additional_results}) suggest factors that should be taken into consideration when applying the method in new settings.
The main choice to be made when using {\ours} is the value of $\beta$, denoting our approximation to the false positive rate of the optimal hypothesis. In our experiments, the setting of $\beta=0.01$ turned out to be good enough to obtain favorable performance with respect to baselines, though we observe that it is not necessarily the optimal choice. For instance in the Tabula-Muris dataset, examining the results under a lower setting of $\beta$ reveals an improvement in performance (details are given in \cref{app:additional_results}). We attribute this to our conservative over-estimation of $\beta(h^*)$, as training an oracle classifier with true labels of $y_{\text{novel}}$ gives a near perfect predictor in terms of test accuracy (i.e. the problem is approximately separable). On the other hand, in MIMIC-III an oracle classifier does not achieve near-perfect accuracy, and decreasing $\beta$ does not improve results. This may be expected, as subgroups of patients, such as those with a certain phenotype are diagnosed using additional features that are not available to the learner. Our conclusion is that while most reasonable choices of $\beta$ with a sufficiently small value lead to favorable performance w.r.t baselines, reasoning about the expected $\beta(h^*)$ with domain knowledge can further improve the performance of {\ours}.

We note that in many works on robustness to distribution shifts, benchmark tasks are designed to fail standard methods such as Empirical Risk Minimization (e.g. the Waterbirds and CelebA examples in \citep{Sagawa2020Distributionally}, or Colored MNIST in \citep{arjovsky2019invariant}). In contrast, our setting randomly assigns prevalence of human-annotated subgroups (following \Cref{eq:varying_mixtures}), hence the shifts are \emph{not} specially designed to create extreme and adversarial scenarios. This suggests that accounting for distribution shifts with our method might be beneficial in many cases, and the results are not limited to carefully designed examples. With that being said, in \cref{app:additional_results} we demonstrate by further experiments on MIMIC-III and a synthetic example that when there is no distribution shift between $\Psource$ and $\Plabel{0}$, \ours{} does not improve over the baselines. Let us turn to conclude our work with a broad overview of the results and potential ways forward.
% while being slightly conservative (as evident from $\mathrm{sign}\left[ 1 - \hat{\alpha} / \alpha \right]$ in \Cref{tbl:results_main}). 
\begin{comment}
In \Cref{app:additional_results} we show how this choice can affect performance in different settings over the Tabula-Muris dataset. In general, smaller values of $\beta$ are beneficial in detecting subgroups with small mixture proportions that can be separated very well (some small classes in the dataset are examples of such occurences). When separation is not near perfect novel . In both cases, examining the ratio $\alpha(h) / \beta$ for the selected hypothesis $h$ under a fixed $\beta$, and choosing a value that results in a large ratio yields rather good performance. Yet generally, the choice of $\beta$ should be guided by domain expertise or auxiliary information, we expand on this slightly in the next section, where we turn to conclude this work.
\end{comment}