\section{Real-world Experiments}\label{sec:experiments}

The simulation results in Section~\ref{sec:simulations} demonstrate the effectiveness of CET in settings with sufficient event similarity and common event rate. In this section, we implement our CET approach on two real-world datasets, analyzing the extent to which their ability to leverage information from a more common outcome can facilitate better performance.

Our two datasets are comprised of electronic health record (EHR) data. The first comes from a preeclampsia study on women with hypertensive disorders of pregnancy \citep{meng2023maternal} and the second comes from an early autism study on children under the age of 18 months. We are interested in using each dataset to train a prognostic model for a rare outcome and wish to leverage similar alternative patient outcomes.

Similar to Section~\ref{sec:simulations}, we compare CET-LR and CET-NN to a single-label learning baseline and MLL without a similarity penalty. For both datasets, we use a validation set to perform early stopping and hyperparameter tuning. We employ AUC as our performance metric and show results on the rare outcome of interest. This section shows results for NN models due to their superior performance. See Appendix~\ref{apdx:exp-details-training} for more implementation details and Appendix~\ref{apdx:exp_lr} for results using LR models.

\subsection{Maternal Morbidity in Preeclampsia}

The maternal morbidity dataset is composed of 553,658 patients with hypertensive disorders of pregnancy. The input features include patient demographics and ICD code-based diagnoses and medical procedures as well as hospital-level characteristics. The dataset contains four binary outcomes denoting whether a rare morbidity event occurred within one-year post-delivery. These outcomes are stroke (event rate 0.075\%), hypertensive crisis (0.193\%), heart failure (0.248\%), and acute renal failure (0.171\%). Stroke has lowest event rate and high clinical importance, therefore we select it as our primary outcome of interest. We separately assess the benefit of using each of the remaining outcomes as our common event, and discuss the possibility of adopting an architecture that uses all four outcomes together in Section~\ref{sec:conclusion}.

This dataset is unusually large compared to single-institution clinical datasets more commonly used to train risk prediction models due to the complexities of sharing medical data across sites. To align more closely with such datasets, we randomly sampled a subset of 80,000 patients for model training and allocated the remaining patients to be used for evaluation. In doing so, we are also able to mitigate the unstable performance metrics that can often be produced with small test sets for rare events. 

Figure~\ref{fig:FAER} shows that CET-NN consistently outperforms the baseline when each of the three other morbidity outcomes are used as the common event, with the most significant improvement observed when using hypertensive crisis. 
%Figure~\ref{fig:FAER} shows that CET-NN consistently outperform the baseline regardless of which of the three other morbidity outcomes they use as the common event. This improvement is most significant when hypertensive crisis is used as the common event, with this version of \mulabbr performing the best of all the models.
Given the similar event rates among the three common outcomes we considered, the increased benefit of using hypertensive crisis suggests a high degree of similarity between its risk factors and the risk factors of stroke. This aligns with clinical understanding of the strong link between hypertensive crisis and stroke \citep{pistoia2016hypertension}.

\begin{figure}[!htb]
  \centering
  \includegraphics[width=0.85\linewidth]{images/FAER_nn}
  \caption{Boxplots showing pairwise improvement in the AUC of stroke prediction via CET-NN across 10 iterations. The red line indicates the single-label learning baseline.}\label{fig:FAER}
\end{figure}

\subsection{Early Autism Prediction}

The early autism dataset contains medical information on 18,156 children. Features for a given patient were derived by first extracting each diagnosis and procedure code documented in that patient's chart from birth to 18 months, then mapping them to corresponding 256-dimensional word2vec embeddings. The resulting diagnosis and procedure embeddings were mean-pooled, and the resulting vectors were concatenated to a single 512-dimensional feature vector. The outcomes are comprised of multiple neurodevelopmental diagnoses (ND) including ADHD, developmental delay, language delay, motor delay, and autism. We select autism as the target outcome (event rate 2.2\%) and define the presence of any other ND as the common event (18.5\%). We also explore the effect of splitting common event to single ND in the Appendix~\ref{apdx:autism}. We divided the data into training and testing sets with a ratio of 4:1, and implemented the same model training and validation strategies as in previous experiments.

The result shows that incorporating more common ND outcomes into MLL models via \mulabbr (Figure~\ref{fig:Autism}) or \sinabbr (Appendix~\ref{apdx:exp_lr}) significantly enhances autism prediction performance.
%without compromising the prediction of these common outcomes
This indicates a substantial similarity in features or latent features across various ND outcomes. 
% Notably, models with an NN structure achieve superior baseline performance compared to those using LR, as well as showing a more pronounced enhancement effect through CET. This suggests that the embedding features derived from diagnosis and procedures in EHRs are nonlinearly associated with ND outcomes.

\begin{figure}[!htb]
  \centering
  \includegraphics[width=0.7\linewidth]{images/Autism_nn}
  \caption{Boxplots showing pairwise improvement in the AUC of autism prediction via CET-NN across 10 iterations. The red line indicates the single-label learning baseline.}\label{fig:Autism}
\end{figure}
