\section{Simulations}\label{sec:simulations}
In this section, we conduct a simulation study to test the performance of \sinabbr and \mulabbr under a wide range of settings. Our goal is to determine how the performance of \sinabbr and \mulabbr are affected by (i) the underlying data generation process (DGP), (ii) the degree of similarity between the rare and common outcomes, and (iii) the event rate of the common outcome. We outline the linear and non-linear DGPs we use in these simulations and other simulation setups details in Section~\ref{sec: exp-setup}. We then explore the impact of varying event similarity in Section~\ref{sec: exp-vary-sim} and varying common outcome event rate in Section~\ref{sec: exp-vary-rate}. Our results provide a foundation for understanding the conditions under which CET methods provide additional benefit and set the stage for the use of these methods in real-world clinical applications (Section~\ref{sec:experiments}).
\footnote{Code to run experiments available at \url{
https://github.com/engelhard-lab/rare_event_mll}.}

\subsection{Setup}\label{sec: exp-setup}
Both our linear and non-linear DGPs are designed to simulate a setup where 25 input patient features are used to predict a rare disease of interest, $y_{i, 1}$, whose underlying risk function is related to a more common disease, $y_{i, 2}$. The degree of similarity between the underlying risk functions for $y_{i, 1}$ and $y_{i, 2}$ is a controllable parameter passed as an argument to the DGP. For details on both the linear and non-linear DGP setups see Appendix~\ref{apdx:exp-details-dgp}. 

For all experiments using the linear DGP, we generate a training set of 15,000 synthetic patients. We increase the number of training samples to 75,000 for the non-linear DGP. In addition to the $L_2$ similarity penalty in Equations~\ref{eq:sin-ll} and \ref{eq:multi-ll}, we consider a corresponding $L_1$ variety as well as a version that uses cosine similarity. 

We compare \sinabbr and \mulabbr to single-label learning trained exclusively on $y_{i, 1}$, and standard MLL without similarity penalty (\textit{i.e.}, CET method with $s=0$). All methods use standard ridge (\textit{i.e.}, $L_2$) regularization to avoid overfitting. Both $s$ and the regularization strength parameter are optimized by grid search on a validation set. See Appendix~\ref{apdx:exp-details-training} for further training details.

We use a large test set of 35,000 synthetic patients to reduce the uncertainty in the evaluation stemming from the rarity of the outcomes. We adopt the evaluation method described in \cite{kmetzsch2022disease}, which uses Spearman's rank correlation ($\rho$) as the primary metric on simulated data to assess the discrepancy between predicted and actual disease risk rankings. We include results using AUC as the evaluation metric in Appendix \ref{apdx:sim}, as it serves as a proxy for $\rho$ in real-world settings where true risk is unknown.

In Appendix~\ref{apdx:comparison}, we include additional results comparing CET approaches to Firth regression, gradient boosting, and transfer learning. These methods provide alternative baselines for rare event prediction and are discussed further in the Appendix.

\subsection{Varying Event Similarity}\label{sec: exp-vary-sim}

We first explore how the benefit of CET is affected by event similarity by varying the risk function similarity between the rare and common event from 0\% to 100\%. We set the expected event rate to 1\% for the rare disease and to 5\% for the more common disease. Simulations are  run using both the linear and non-linear DGPs. 

Figure~\ref{fig:similarity}a shows how \sinabbr enhances rare disease prediction under the linear DGP at different similarity levels. All three variants of \sinabbr outperform the baseline single-label learning when event similarity exceeds a certain threshold ($\sim$40\%), and this improvement grows as event similarity increases. Note that in contrast to the non-linear setting (Figure~\ref{fig:similarity}b), the standard MLL  without a similarity penalty has no benefit, because no information is shared between labels without a similarity regularization term.

Figure~\ref{fig:similarity}b shows the results for the same experiment using the non-linear DGP, where we see similar results for \mulabbr as we did for \sinabbr on the linear DGP. However, in this setup, the standard MLL approach (i.e. no similarity penalty) does improve upon single-label learning, a result that is consistent with the literature on MLL with neural networks. However, we note that the similarity penalty can result in further improvement, specifically when there is sufficient overlap of the latent features and their corresponding weights (\textit{e.g.}, $\ge$40\%). 

\begin{figure}[!htb]
  \centering
  \includegraphics[width=1\linewidth]{images/similarity}
  \caption{Boxplots representing pairwise enhancement for rare disease prediction based on 10 iterations for each event similarity setting. The red line indicates the baseline of single-label learning for each iteration.}\label{fig:similarity}
\end{figure}

We further analyze the similarity penalty parameter ($s$) selected via validation versus the underlying event similarity in Appendix~\ref{apdx:sim_lam}. The positive correlation supports the claim that the improved performance of the CET methods is coming from the added similarity penalty.

\subsection{Varying Event Rate}\label{sec: exp-vary-rate}

We now explore how the benefit of CET is affected by the event rate of the more common event. To do so, we hold the event similarity and rare outcome event rate constant at 80\% and 1\%, respectively. We then vary the common outcome event rate from 1\% to 30\%. 

Figure~\ref{fig:event_rate_linear}a shows that increasing the event rate of $y_{i, 2}$ not only improves predictive performance for $y_{i, 2}$, but also provides substantial improvement for $y_{i, 1}$. We observe a similar trend in non-linear settings (Appendix \ref{apdx:er_mlp}), supporting the claim that CET methods can leverage the additional information in a dataset for a more common event to help overcome the lack of information for a rarer event. 


% Although, we note that the  the improvement in rare event reaches plateau when common event prediction continues to improve albeit a decreasing rate.

% Nevertheless, it is notable that there appears to be an upper limit to the benefits derived from increasing the event rate. Specifically, the improvement in rare disease prediction plateaus when the common event rate is approximately tenfold that of the rare event. In contrast, common event prediction performance continues to improve beyond this threshold, albeit at a progressively decreasing rate.

\begin{figure}[!htb]
  \centering
  \includegraphics[width=1\linewidth]{images/event_rate_linear}
  \caption{Performance of single-label learning and CET-LR on (a) rare and (b) common diseases generated by the linear DGP. The rare disease ($y_1$) event rate is 1\%, and the common disease ($y_2$) event rate is varied from 1\% to 30\%. Shading represents 95\% confidence intervals.}\label{fig:event_rate_linear}
\end{figure}