\section{Additional Details on Experimental Results} \label{app:additional_results}
In this section we provide additional details on our experiments. We start with details about relative AU-ROC and Av.-Precision in cases where {\ours} is not the best performing method. On MIMIC-III {\ours} obtains a mean relative AU-ROC (denoted by $\text{AU-ROC}/\text{AU-ROC}_{\text{best}}$) of $0.997$ even in the $3$ rounds where it is not the best method. For comparison, the propensity baseline achieves a mean relative AU-ROC of $0.953$ in rounds where it is not the best method. Similarly with relative Av.-Precision, the propensity baseline achieves a mean value of $0.844$ in rounds where it is not the best performing method, while {\ours} has mean $0.964$ under the respective rounds.
For Tabula-Muris, the relative AU-ROC upon not being best performing is similar for all methods, but in relative Av.-Precision {\ours} only loses one round and it is comparable to the best performing method as it achieves $0.995$ relative AU-ROC. The losses for other methods are by a far more significant margin, as implied in \cref{tbl:results_main}.

The rest of this section begins by describing the way we generate distribution shifts, continue to implementation details, and finally provide a few additional analyses.

\textbf{Generation of distribution shifts.} As explained in \cref{sec:experiments}, from the collection of available labels in the dataset $\gY$, one label is taken as the novel subgroup $y_{\text{novel}}$. Then for each label $y\in{\gY\setminus{\{y_{\text{novel}}\}}}$, denoting by $I_y=\{i: y_i = 1\}$ the examples with label $y$, we draw a number $\gamma_y$ uniformly from $[0.1, 1]$ and put (randomly drawn) $\gamma_y\cdot |I_y|$ of the examples with label $y$ in $S_{\gS}$. The other $(1-\gamma_y)\cdot |S_{\gT}|$ go in $S_{\gT}$. In MIMIC-III, the labels are phenotypes and each example (corresponding to a patient admission) can be assigned with more than one label. In this case we iterate over the different labels in some order and create the shift for each one as described above, but $I_y$ will not contain indices where patients were assigned with a label that came before it in the iterative process. Before the iterative process begins we also keep away all the examples belonging to the novel class $\{\rvx\}_{i\in{I_{\text{novel}}}}$ where $I_{\text{novel}} = \{i: y_{\text{novel}} = 1\}$ and put them in $S_{\gT}$.

Finally, we also draw validation sets $V_{\gS}, V_{\gT}$ out of $S_{\gS}$ and $S_{\gT}$ respectively, to be used for model selection and validation as described later in the next part.

\textbf{Details on implementation and model selection.}
In our experiments we use a multilayer perceptron with $2$ hidden layers for Tabula-Muris (feature dimension $2866$, number of hidden units at each layer $64$), following \citep{cao2021concept}, and a linear model for MIMIC-III (features dimension is $714$) used as one of the methods in \citep{harutyunyan2910multitask}. We note that the computational complexity of the algorithm depends on the implementation of the constrained optimization step (line $4$ in \cref{alg:conoc}), results on some methods are given in \citet{chamon2022constrained, cotter2019optimization} and to obtain the computational complexity of \ours{} we should multiply the running time by $L$, which is the size of $\boldsymbol{\alpha}$. It is likely that this runtime can be reduced significantly by more efficient search methods for $\alpha$, we keep exploration of implementation improvements for \ours{} to future work.

The Domain Discriminator baseline, $h_{\text{disc}}$, is trained by minimizing the log-loss. For MIMIC-III we use the cross validated Logistic Regression method from sklearn \citep{scikit-learn}, while for Tabula Muris we train with Adam \citep{DBLP:journals/corr/KingmaB14} for $150$ epochs and select the weights at the end of the epoch where the model achieves highest accuracy (on classification of $V_{\gS}$ vs. $V_{\gT}$) over a held-out validation set.
The propensity-weighted baseline is trained in the same manner, except we use the following weighted loss from \citet{bekker2019beyond, gerych2022recovering}:
\begin{align}\label{eq:prop_weighted_loss}
    R^{\mathrm{log}}_{\gS, e}(h) &= n_{\gS}^{-1}\sum_{\rvx\in{\datasource}}{e(\rvx)^{-1}l_{\mathrm{log}}(h(\rvx), 0) + (1-e(\rvx)^{-1})l_{\mathrm{log}}(h(\rvx), 1)} \nonumber \\
    &+ n^{-1}_{\gT}\sum_{\rvx\in{\datatarget}}{l_{\mathrm{log}}(h(\rvx), 1)}.
\end{align}
Here $e(\rvx)$ is the propensity score, which we obtain from the output of the Domain Discriminator model, $h_{\text{disc}}$. Namely, it is the probability assigned by the model used in $h_{\text{disc}}$ that the example $\rvx$ is from $\Psource$. We calibrate the Domain Discriminator model over the validation set using Platt scaling \citep{platt1999probabilistic} before retrieving $e(\rvx)$, this improves the propensity score estimation and also downstream performance on the learning task. Finally, model selection for this baseline is the same as for the $h_{\text{disc}}$, except we use the weighted accuracy (i.e. \Cref{eq:prop_weighted_loss} with $l_{0-1}$ instead of unweighted accuracy).

In both datasets, {\ours} is trained by alternating steps of Adam for the model parameters, and gradient descent for the Lagrange multiplier. Model selection for {\ours} is done by selecting weights at the end of the epoch where the recall, $\hat{\alpha}(h) = |V_{\gT}|^{-1}\sum_{\rvx\in{V_{\gT}}}{h(\rvx)}$, is highest and False Positive Rate, $|V_{\gS}|^{-1}\sum_{\rvx\in{V_{\gS}}}{h(\rvx)}$, is smaller than $\beta = 0.01$. We train models with several values of $\alpha$ and choose the final model using this criterion.

For additional details on the implementation of methods, please advise our code, to be released \href{https://github.com/yowald/OOD-Novel-Category/tree/main}{here} upon publication.

\textbf{Mixture Proportion Estimation} As mentioned in \cref{sec:experiments}, the outputs of $h_{\text{disc}}$ and the propensity weighted risk minimizer are not good binary classifiers in case we simply set their decision threshold at probability $0.5$ for $y=1$. Instead we need to adjust this threshold with a Mixture Proportion Estimation. We use methods from \citet{elkan2008learning, li2003learning}, denoted by $EN$ and $FPR < 0.1$ respectively.
The first estimator is designed under the assumption that $\Psource=\Plabel{0}$, while the second one is included since it follows the model selection principle we use in our method of thresholding the FPR.
To report the MPE for {\ours} we simply use $\hat{\alpha}(h)$, the fraction of positive labels predicted on the validation set from the target distribution.

For the same reasons mentioned in \cref{sec:experiments}, the metric we use for evaluation is a relative metric. We denote the estimated mixture proportion by $\hat{\alpha}$, the true proportion by $\alpha$, and use a quantity we call Relative Absolute Mixture Proportion Error (RAMPE), $|1-\frac{\hat{\alpha}}{\alpha}|$. E.g., if the novel class comprises $4\%$ of the population, and our approximation is $1\%$, the RAMPE is $0.75$.

As seen in \cref{fig:mpes}, the combination of the estimator from \citet{elkan2008learning} and the domain discriminator give the best performance for MIMIC-III (note that the domain discriminator is worst in terms of AU-ROC and Av.-Precision according to \cref{tbl:results_main} of the main paper). However, this estimator is very inaccurate for the Tabula Muris dataset. Occurences of such large errors may be expected, as the estimator is designed under the assumption that $\Psource=\Plabel{0}$. Hence, while it may happen to provide a reasonable estimate at times, it can have very large errors at others.
These results suggest that in terms of estimating the mixture proportion, no single combination of baseline algorithm and MPE technique is preferred for both datasets, while {\ours} performs comparably to using the $FPR$, which avoids the very large errors that estimators based on the SCAR assumption can incur. 

\begin{figure}[t]
\centering
\begin{tabular}{| ll | cc |} 
        \toprule
        \multirow{3}{*}{Algorithm} &
        \multicolumn{3}{|c|}{RAMPE: $| 1 - \hat{\alpha} / \alpha| $ } \\
        & \multicolumn{3}{|c|}{}\\ %\multicolumn{3}{|c}{($\mathrm{sign}\left[ 1 - \hat{\alpha} / \alpha \right]$)} \\
         & \multicolumn{1}{|c}{MPE method} &  MIMIC-III & Tabula Muris \\ \midrule \midrule
       \multirow{2}{*}{Domain Disc.} & $EN$ & $\mathbf{0.28} \pm 0.18$ & $6.60 \pm 5.52$ \\
       & $FPR < 0.1$  & $0.55 \pm 0.12$ & $0.72 \pm 0.86$ \\  \midrule
       \multirow{2}{*}{Propensity} & $EN$  & $0.62 \pm 0.14$ & $6.58 \pm 5.45 $ \\
       & $FPR < 0.1$  & $0.54 \pm 0.12$ & $\mathbf{0.50} \pm 0.57$ \\ \midrule
        \multirow{2}{*}{\ours} & & \multirow{2}{*}{$0.44 \pm 0.11$} & \multirow{2}{*}{$0.76 \pm 0.99$} \\
         & & & \\ \bottomrule
\end{tabular}
\caption{Average Relative Absolute Mixture Proportion Error ($|1-\hat{\alpha}/\alpha|$) for evaluated methods, where $\hat{\alpha}$ is the estimated proportion and $\alpha$ is the true one.
The estimator derived in \citet{elkan2008learning} under assumptions that do not hold in our setting of distribution shift, demonstrates unstable performance while thresholding FPR values seems to offer comparable performance on all methods.}
\label{fig:mpes}
\end{figure}
\begin{figure}[!t]
\begin{minipage}[!t]{0.49\columnwidth}
    \centering
    \includegraphics[scale=0.46]{figures/toy_example_no_shift.png}
    % \includegraphics[width=10cm]
    % \caption{AAA}\label{fig:AAA}
\end{minipage}
% \hfill{}
\begin{minipage}[!t]{0.49\columnwidth}
    \centering
    \includegraphics[scale=0.46]{figures/toy_example_rocs_no_shift.png}
    %width=20cm, bb=0 0 1200 900
    % \includegraphics[width=0.4\linewidth]
    % \caption{BBB}\label{fig:BBB}
\end{minipage}
% \begin{subfigure}
% \includegraphics[width=0.4\linewidth]{figures/toy_example.png}
% \end{subfigure}
% \begin{subfigure}
% \includegraphics[width=0.4\textwidth]{figures/toy_example_rocs.png}
% \end{subfigure}
\caption{\textbf{(Left)} Toy example from \cref{sec:toy_example}, but without distribution shift. The learned models mostly differ in their bias terms, hence under an appropriate choice of the decision threshold (e.g. via the results of \citet{elkan2008learning}) both can detect the novel category successfully. Hence \ours{} performs on-par with unconstrained approaches. \textbf{(Right)} The ROC-Curves of the two classifiers coincide, emphasizing their equivalence in terms of ability to detect the novel category.}
\label{fig:toy_example_no_shift}
\end{figure}

\textbf{Effect of distribution shift on performance of \ours{} vs. baselines.} Continuing our motivating example from \cref{sec:toy_example}, \cref{fig:toy_example_no_shift} shows how the methods compare under the same categories, but when there is no distribution shift, i.e. $\Psource=\Plabel{0}$. As expected, the methods learn weights that are very close to one another, but differ in their bias terms. Hence in terms of detection abilities for the novel class they are equivalent. That is, under an appropriate setting of the decision threshold, both models will detect the novel class.

\textbf{Details on values of $\alpha$ and raw AU-ROC values.}
We include raw AU-ROC and AU-PRC values for all repetitions of our experiments. \cref{tab:tabula_muris_details} has gives the details for the Tabula-Muris experiments. We observe that in repetitions where $\alpha$ is large, the gap between \ours{} and baselines is somewhat smaller. This is intuitive, since in case the addition of the novel category makes up most of the distribution shift between $\Psource$ and $\Ptarget$, we arrive at a case that is somewhat similar to our synthetic example in \cref{fig:toy_example_no_shift} where \ours{} coincides with a domain discriminator. However, in all repetitions \ours{} has either the best AU-ROC or AU-PRC, and in 5 out of 8 runs it is best on both metrics. We note that for very small classes (i.e. smaller than $0.002$), all methods perform poorly and we do not include such novel categories in our experiments.

\begin{table}[t]
\centering
\begin{tabular}{|l|c|c|c|c|c|c|c|c|}
\hline
Trial index & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\
\hline
AU-ROC Domain Disc. & 0.904 & 0.891 & 0.774 & 0.784 & 0.741 & 0.931 & 0.995 & 0.932 \\
AU-ROC Propoensity & 0.885 & 0.884 & \textbf{0.790} & 0.730 & \textbf{0.834} & 0.905 & 0.986 & 0.919 \\
AU-ROC \ours{} & \textbf{0.914} & \textbf{0.947} & 0.755 & \textbf{0.860} & 0.791 & \textbf{0.958} &  \textbf{0.996} & \textbf{0.978} \\
\hline
AU-PRC Domain Disc. & 0.656 & 0.268 & 0.324 & 0.217 & 0.450 &  0.835 & 0.994 & \textbf{0.858} \\
AU-PRC Propensity & 0.234 & 0.088 & 0.152 & 0.058 & 0.211 & 0.572 & 0.900 & 0.458 \\
AU-PRC \ours{} & \textbf{0.764} & \textbf{0.505} & \textbf{0.410} & \textbf{0.400} & \textbf{0.570} & \textbf{0.906} & \textbf{0.995} & 0.854 \\ 
\hline
$\alpha$ & 0.0106 & 0.0181 & 0.0667 & 0.0459 & 0.0385 & 0.1527 & 0.2367 & 0.0517 \\
\hline
\end{tabular}
\caption{Raw AU-PRC and AU-ROC values and size of novel category, $\alpha$, for all runs on the Tabula-Muris dataset. \ours{} performs best both in terms of AU-PRC for $5$ out of $8$ runs, other runs do not have a distinct winning method, though \ours{} performs best either in terms of AU-ROC or AU-PRC on all runs.}
\label{tab:tabula_muris_details}
\end{table}

In MIMIC-III we use one phenotype as the novel category and draw different distribution shifts on each repetition. Hence the size of the novel category does not vary much, and it is $\alpha = 0.075 \pm 0.002$. The raw values of the AU-ROC and AU-PRC can still change quite a lot, since different shifts entail different detection abilities for all the methods. Hence \cref{tab:mimic_details} gives the details results for these runs. We observe that the performance of all methods changes in unison according to the drawn shifts (as explained earlier, some shifts entail more difficult problems than others), but in relative performance \ours{} performs best on most repetitions.

\begin{table}[h]
\centering
\begin{tabular}{|l|c|c|c|c|c|c|c|c|}
\hline
AU-ROC & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\
\hline
Domain Disc. & 0.757 & 0.825 & 0.837 & 0.753 & 0.812 & 0.841 & 0.763 & 0.740 \\
Propoensity & 0.771 & 0.832 &  0.840 & \textbf{0.789} & 0.850 & 0.821 &
 0.795 & 0.777 \\
\ours{} & \textbf{0.837} & \textbf{0.857} & \textbf{0.848} & 0.787 & \textbf{0.858} & \textbf{0.867} & \textbf{0.829} & \textbf{0.845} \\
\hline
\end{tabular} \\
\begin{tabular}{|l|c|c|c|c|c|c|c|c|}
\hline
AU-PRC & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\
\hline
Domain Disc. & 0.275 & 0.412 & 0.389 & 0.262 & 0.348 & 0.374 &  0.295 & 0.291 \\
Propensity & 0.299 & 0.401 & 0.389 & \textbf{0.314} & 0.395 & 0.367 & 0.340 & 0.324 \\
\ours{} & \textbf{0.402} & \textbf{0.461} & \textbf{0.433} & 0.295 & \textbf{0.422} & \textbf{0.432} & \textbf{0.401} & \textbf{0.396} \\
\hline
\end{tabular} \\
\begin{tabular}{|l|c|c|c|c|c|c|c|}
\hline
AU-ROC & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\
\hline
Domain Disc. & 0.798 & 0.813 & 0.760 & \textbf{0.819} & 0.741 & 0.831 & 0.782 \\
Propensity & 0.819 & \textbf{0.856} & 0.803 & 0.755 & 0.767 & 0.838 & 0.798 \\
\ours{} & \textbf{0.857} & 0.855 & \textbf{0.840} & 0.817 & \textbf{0.838} & \textbf{0.857} & \textbf{0.829} \\
\hline
\end{tabular} \\
\begin{tabular}{|l|c|c|c|c|c|c|c|}
\hline
AU-PRC & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\
\hline
Domain Disc. & 0.351 & 0.350 & 0.257 & 0.322 & 0.260 & 0.420 &  0.246 \\
Propensity & 0.402 & 0.411 & 0.314 & 0.270 & 0.297 & 0.424 & 0.261 \\
\ours{} & \textbf{0.465} & \textbf{0.476} & \textbf{0.382} & \textbf{0.319} & \textbf{0.396} & \textbf{0.462} & \textbf{0.325} \\
\hline
\end{tabular}
% \begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
% \hline
% AU-PRC & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\
% \hline
% Domain Disc. & 0.656 & 0.268 & 0.324 & 0.217 & 0.450 &  0.835 & 0.994 & \textbf{0.858} \\
% Propensity & 0.234 & 0.088 & 0.152 & 0.058 & 0.211 & 0.572 & 0.900 & 0.458 \\
% \ours{} & \textbf{0.764} & \textbf{0.505} & \textbf{0.410} & \textbf{0.400} & \textbf{0.570} & \textbf{0.906} & \textbf{0.995} & 0.854 \\ 
% \hline
% \end{tabular}
\caption{Raw AU-PRC and AU-ROC values for all runs on the MIMIC-III dataset. \ours{} performs best both in terms of AU-PRC for $5$ out of $8$ runs, other runs do not have a distinct winning method, though \ours{} performs best either in terms of AU-ROC or AU-PRC on all runs.}
\label{tab:mimic_details}
\end{table}
% \newpage
\textbf{Effect of $\beta$ on performance of {\ours}.} We use the Tabula-Muris dataset to examine the effect of choosing different values of $\beta$ in our procedure. To do that we take the training history from running \cref{alg:conoc}, and change the model selection in the last step of the algorithm to have a different value of $\beta$. Hence by varying the values of $\beta$ we choose different models. \cref{fig:beta_examination} shows the relative AU-ROC and Av.-Precision as we vary $\beta$ between low and high values. It is important to note that the problem here is separable in the sense that training classifiers with true label for the novel class achieves AU-ROC values around the range of $0.97$ to $0.99$. Hence low values of $\beta$ are expected to produce favorable results, as may be confirmed by the figure. As we move towards larger values of $\beta$ the metrics become more noisy and also comparable to the baseline (we do not show the propensity estimation baseline since it has inferior performance in this dataset). It also worth mentioning that increasing $\beta$ only affects the performance of {\ours}, hence the change in relative performance is only due to variation in performance of our method. Our conclusion is that while the method is robust to the choice of $\beta$, large deviations from the ideal selection $\beta(h^*)$ will result in degraded performance.
\begin{figure}[t]
\centering
\includegraphics[width=0.4\textwidth]{figures/beta_effect_au_roc.png}
\includegraphics[width=0.4\textwidth]{figures/beta_effect_av_prec.png}
\label{fig:beta_examination}
\end{figure}
