\section{Additional Experimental Results}\label{apdx:add-results}

\subsection{Comparison with Alternative Approaches}\label{apdx:comparison}
We compare our proposed methods with several established approaches in the context of rare event prediction and multi-event information sharing. For single-label learning, we use Firth logistic regression and gradient boosting as two alternative baselines, known for their high performance in rare event prediction with high-dimensional datasets similar to our setup \citep{doerken2019penalized}. We conduct feature selection using LASSO prior to running Firth logistic regression to avoid convergence issues that arise on large datasets with linearly dependent variables. For multi-event learning, we include transfer learning as an additional baseline where neural networks were pre-trained on common events and then fine tuned on the target event.

The rare event prediction performance of all alternative approaches, as well as the methods we propose, is summarized in Table \ref{tab:simulation} and Table \ref{tab:real-world} for synthetic datasets using non-linear DGP and real-world datasets, respectively. For CET methods, we only include the result of using $L_2$ magnitude for similarity. 

Notably, our proposed method CET-NN outperforms all other methods across all real-world datasets and in the synthetic dataset when event similarity exceeds 40\%.


\begin{table}[h]
    \centering
    \caption{Average (standard deviation) AUC or Spearmen's correlation coefficient on synthetic non-linear DGP.}\label{tab:simulation}
    \begin{tabular}{c|l|ccccccc}
        \hline
         && \multicolumn{3}{c}{Multi-label learning} & \multicolumn{4}{c}{Single-label learning} \\
        \hline
         & \makecell{Similarity} & Multilabel NN & Transfer NN& CET NN & LR & Firth LR & GDBoost & NN \\
        \hline
         \multirow{6}{*}{AUC} & 100\%  &  0.880 (.070)  &  0.886 (.067) & 0.895 (.067) & \multirow{6}{*}{\makecell{0.884 \\ (.079)}} & \multirow{6}{*}{\makecell{0.884 \\ (.078)}} & \multirow{6}{*}{\makecell{0.869 \\ (.081)}} & \multirow{6}{*}{\makecell{0.871 \\ (.074)}} \\
         & 80\%  & 0.880 (.069) & 0.883 (.070) & 0.890 (.070) & & & & \\
         & 60\%  & 0.878 (.071) & 0.878 (.071) & 0.884 (.072) & & & & \\
         & 40\%  & 0.875 (.072) & 0.877 (.074) & 0.881 (.073) & & & & \\
         & 20\%  & 0.874 (.072) & 0.875 (.072) & 0.876 (.076) & & & & \\
         & 0\%   & 0.873 (.073) & 0.874 (.075) & 0.876 (.073) & & & & \\
         \hline
        \multirow{6}{*}{$\rho$} & 100\% & 0.842 (.045) & 0.861 (.050) & 0.900 (.040) & \multirow{6}{*}{\makecell{0.778 \\ (.049)}}& \multirow{6}{*}{\makecell{0.780 \\ (.049)}}& \multirow{6}{*}{\makecell{0.731 \\ (.049)}}& \multirow{6}{*}{\makecell{0.796  \\ (.051)}} \\
         & 80\%  & 0.834 (.049) & 0.849 (.048) & 0.873 (.042) & & & & \\
         & 60\%  & 0.820 (.047) & 0.828 (.046) & 0.840 (.042) & & & & \\
         & 40\%  & 0.811 (.051) & 0.826 (.047) & 0.824 (.051) & & & & \\
         & 20\%  & 0.801 (.051) & 0.806 (.045) & 0.796 (.053) & & & & \\
         & 0\%   & 0.799 (.055) & 0.799 (.045) & 0.797 (.049) & & & & \\
        \hline
    \end{tabular}
\end{table}

\begin{table}[h]
    \centering
    \caption{Average (standard deviation) AUC on real-world rare disease.}\label{tab:real-world}
    \begin{tabular}{l|l|ccccccc}
    \hline
    % \multirow{2}{*}{\makecell{Target}} & \multirow{2}{*}{\makecell{Surrogate disease}} & \multicolumn{3}{c}{Multi-label learning} & \multicolumn{4}{c}{Single-label learning} \\
    & & \multicolumn{3}{c}{Multi-label learning} & \multicolumn{4}{c}{Single-label learning} \\ \hline
    \makecell{Target} & \makecell{Secondary disease} & Multilabel NN & Transfer NN & CET NN & LR & Firth LR & GDBoost & NN \\ \hline
    Stroke & Hypertensive crisis & 0.652 (.008) & 0.639 (.012) & 0.670 (.017) & \multirow{3}{*}{\makecell{0.605 \\ (.018)}} & \multirow{3}{*}{\makecell{0.633 \\ (.019)}} & \multirow{3}{*}{\makecell{0.643 \\ (.016)}} & \multirow{3}{*}{\makecell{0.627 \\ (.009)}} \\
    Stroke & Heart failure & 0.651 (.010) & 0.636 (.020) & 0.656 (.013) & & & & \\
    Stroke & Renal failure & 0.653 (.014) & 0.630 (.014) & 0.656 (.025) & & & & \\ \hline
    Autism & Any other ND & 0.770 (.019)& 0.731(.026)& 0.775 (.020)& \multirow{3}{*}{\makecell{0.721 \\ (.022)}}& \multirow{3}{*}{\makecell{0.733 \\ (.029)}}& \multirow{3}{*}{\makecell{0.726 \\ (.020)}}& \multirow{3}{*}{\makecell{0.738 \\  (.023)}}\\ 
    Autism & Language delay & 0.768 (.021)& 0.743 (.024) & 0.772 (.025)& & & & \\
    Autism & Motor delay & 0.737 (.014)& 0.732 (.025) & 0.742 (.017)& & & & \\ \hline
    \end{tabular}
\end{table}

\subsection{Additional Results for Simulations Varying Similarity}\label{apdx:sim}

In the simulation experiments using the linear DGP, we test the performance of CET-LR for both rare and common outcomes with two evaluation metrics, Spearman's rank correlation $\rho$ and AUC. The performance of AUC in Figure~\ref{fig:similarity_LR}b shows consistent trend with $\rho$ in Figure~\ref{fig:similarity_LR}a, supporting our strategy of using AUC as a proxy for real-world dataset evaluation. Figure~\ref{fig:similarity_LR}c shows that the performance for $y_2$ is mostly consistent with the baseline, indicating that the learning for the common outcome is not be compromised by the CET approach.


\begin{figure}[htbp]
  \centering
  \includegraphics[width=0.59\linewidth]{images/appendix/similarity_LR.png}
  \caption{Boxplots representing pairwise enhancement of (a) Spearman's rank ($\rho$) for the rare outcome $y_1$, (b) AUC for the rare outcome $y_1$, (c) Spearman's rank for the common outcome $y_2$. The red line indicates baseline of single-label learning for each setup across iterations.}\label{fig:similarity_LR}
\end{figure}
Besides the experiments for models on the corresponded synthetic datasets, i.e., LR for linear DGP and NN for non-linear DGP, we also investigate the LR model performance on the non-linear synthetic dataset. Figrue~\ref{fig:similarity_unmatch} shows that the mismatch between model structure and underlying generative function not only leads to a decreased performance, but further invalidate the CET enhancement across all levels of event similarity.
\begin{figure}[htbp]
  \centering
  \includegraphics[width=0.65\linewidth]{images/appendix/sim_unmatch.png}
  \caption{Boxplots representing pairwise enhancement for the rare outcome $y_1$ of (a) CET-LR, (b) CET-NN on non-linear DGP generated datasets. The red line indicates baseline of single-label learning for each iteration.}\label{fig:similarity_unmatch}
\end{figure}

\subsection{Additional Results for Similarity Penalty Strength}\label{apdx:sim_lam}
To support the claim that the performance enhancement for CET methods is from the added similarity term, we plot the similarity penalty parameter ($s$) selected via validation versus the underlying event similarity. Figure~\ref{fig:penalty_strength} shows the positive correlation between these two factors. It is worth mentioning that this behavior makes our method robust to imposing a similarity penalty on unrelated events when a validation set is used to tune $s$. The fact that this behavior is more apparent in \sinabbr than \mulabbr may be due to our earlier observation that MLL using NNs can leverage shared information across events even without a similarity penalty.

\begin{figure}[htbp]
  \centering
  \includegraphics[width=0.7\linewidth]{images/penalty_strength}
  \caption{Log score of similarity penalty parameter $s$ learned by validation. Shaded areas represent 95\% confidence intervals.}\label{fig:penalty_strength}
\end{figure}

\subsection{Additional Results for Simulation Varying Event Rate}\label{apdx:er_mlp}

Figure~\ref{apdx:event_rate_mlp} shows that increasing the common event rate has a similar impact on prediction performance for CET-LR in a linear setting (Figure~\ref{fig:event_rate_linear}) and CET-NN in a non-linear setting (Figure~\ref{apdx:event_rate_mlp}). Both patterns support the claim that more common events can help rarer event prediction by leveraging additional information via CET methods.

\begin{figure}[htbp]
  \centering
  \includegraphics[width=0.7\linewidth]{images/appendix/event_rate_mlp.png}
  \caption{Performance of single-label learning and CET-NN on (a) rare and (b) common diseases generated by the non-linear DGP. The rare disease ($y_1$) event rate is 1\%, and the common disease ($y_2$) event rate is varied from 1\% to 30\%. Shading represents 95\% confidence intervals.}\label{apdx:event_rate_mlp}
\end{figure}

\subsection{Additional Results for LR Models on Real-world Datasets}\label{apdx:exp_lr}

In this section, we show results using LR models on our real-world datasets. We replot the results using the NN models for the sake of comparison (these are the results shown in Section~\ref{sec:experiments} of the main text). Figure~\ref{fig:FAER_full} shows that both CET-LR and CET-NN significantly enhance the stroke prediction when incorporating a more common outcome, implying a substantial similarity in patient features or latent features across various maternal morbidities. It is notable that models with an NN structure achieve superior baseline performance compared to those using LR in both real-world experiments of stroke (Figure~\ref{fig:FAER_full}) and autism prediction (Figure~\ref{fig:autism_full}), and show a more pronounced enhancement effect through CET. The performance gap between LR and NN is especially significant in the autism dataset, which suggests the embedding features derived from diagnosis and procedures in EHRs are nonlinearly associated with ND outcomes.

We additionally explore the setting of utilizing multiple common events for CET approach in the preeclampsia study, which includes all pairs of coefficients to the CET penalty (Figure~\ref{fig:FAER_full}). The result shows the performance without CET penalty significantly enhanced, but minimal additional improvement from CET penalty are observed .
\begin{figure}[htbp]
  \centering
  \includegraphics[width=0.6\linewidth]{images/appendix/FAER2.png}
  \caption{Boxplots showing pairwise improvement in the AUC of stroke prediction via (a) CET-LR, (b) CET-NN across 10 times resampling. The red line indicates the single-label learning baseline.}\label{fig:FAER_full}
\end{figure}

\begin{figure}[!htb]
  \centering
  \includegraphics[width=0.6\linewidth]{images/appendix/autism_appx.png}
  \caption{Boxplots showing pairwise performance improvement in the AUC of (a) autism via CET-LR, (b) other neurodevelopmental diagnoses (ND) via CET-LR, (c) autism via CET-NN, (d) other ND via CET-NN across 10 times resampling. The red line indicates the single-label learning baseline.}\label{fig:autism_full}
\end{figure}


\newpage
\subsection{Additional Results on Expanded Training Set for Preeclampsia Study}\label{apdx:stroke_expand}

We performed an alternative partitioning by enlarging the training set to 270,000 samples to test the method efficacy in datasets with enriched sample size. This results in an improved baseline performance, yet a diminished enhancement from MLL, shows in Figure~\ref{fig:stroke_expand}.


\subsection{Additional Results of using Single ND event for autism study}\label{apdx:autism}

As discussed in Section~\ref{sec:conclusion}, including additional outcomes could benefit CET by increasing the common event rate. However, there is also the potential that including more unrelated events could reduce the similarity between events. In our autism study in Section~\ref{sec:experiments}, we defined our secondary outcome as a union over several events. This can be seen as a simplistic approach for utilizing more than two outcomes. To investigate the trade-off of including more events in our common event, we examined the performance of MLL and CET methods on the autism dataset when using a single ND outcome as the common outcome or combining multiple NDs into one common outcome (event rate 18.5\%). Specifically, we considered using the two most prevalent ND events, language delay (15.6\%) and motor delay (5.7\%), as the common outcome. The results in Table~\ref{tab:real-world} show that the combined common event demonstrates superior performance than either of these single ND events. Additionally, the CET method is significantly better when using language delay (AUC: 0.772) compared to motor delay (AUC: 0.742), which further validates the enhanced effectiveness of using events with a higher event rate as surrogate events.


\begin{figure}[!htb]
  \centering
  \includegraphics[width=0.6\linewidth]{images/appendix/FAER_expand}
  \caption{Boxplots showing pairwise improvement in the AUC of stroke prediction via CET-NN on the expanded dataset across 10 times resampling. The red line indicates the single-label learning baseline.}\label{fig:stroke_expand}
\end{figure}
