\section{Experimental Details}\label{apdx:exp-details}

\subsection{Data Generating Process (DGP) for Synthetic Dataset}\label{apdx:exp-details-dgp}
In simulation experiments, we first generated a synthetic dataset $\mathbf{X} \in \mathbb{R}^{n\times k}$ comprising $n$ patients with $k$ patient features. All patient features are independently and identically sampled from a normal distribution of $\mathcal{N}(0,10)$. We then use linear or non-linear DGP to generate observations of two binary outcomes, $y_{i,1}$ and $y_{i,2}$, with expected event rate $\pi_{i,1}$ and $\pi_{i,2}$, and expected event similarity $\rho$ between $y_{i,1}$ and $y_{i,2}$. Only a randomly-selected subset of the feature matrix, $\mathbf{X}^{(r)}\in \mathbb{R}^{n\times r}$ with $r\leq k$ relevant features, are used to compute outcome probabilities, while the remaining features served as noise. We set $k=25$ and $r=20$ for all our simulations.

For the linear DGP, we first generated a pair of standardized $r$-dimension vectors $\bm{\theta}_1$ and $\bm{\theta}_2$ with cosine similarity $\rho$, and used these vectors as feature coefficients to compute initial logits for $y_{i,1}$ and $y_{i,2}$ through linear combinations with $\x_i^{(r)}$. The intercept terms $\gamma_1$ and $\gamma_2$ are searched as offsets to align event probability with expected event rates $\pi_1$ and $\pi_2$, which can be written as
\begin{equation}
    \begin{gathered}
    P(y_{i, 1} = 1 | \x_i) = \sigma(\bm{\theta}_1\x_i^{(r)}+\gamma_1) \textrm{, and}
    \\
    P(y_{i, 2} = 1 | \x_i) = \sigma(\bm{\theta}_2\x_i^{(r)}+\gamma_2),
    \end{gathered}
\end{equation}
where $\sigma$ denotes the sigmoid function. Finally, the synthetic observations of $y_{i,1}$ and $y_{i,2}$ are generated through a random Bernoulli draw based on the event probability for each patient.

For the non-linear DGP, we used two mapping function $h_1(\cdot)$ and $h_2(\cdot)$ that map $\x_i^{(r)}$ to $l$-dimensional latent feature vectors $\x_{i,1}^{(l)}$ and $\x_{i,2}^{(l)}$. Specifically, we generated two orthogonal mapping matrices, $\mathbf{W}_1$ and $\mathbf{W}_2$, each combined with a ReLU activation function. Thus the mapping can be written as
\begin{equation}
    \begin{gathered}
    \x_{i,1}^{(l)} = h_1(\x_{i}^{(r)}) = \text{ReLU}(\mathbf{W}_1\x_i^{(r)}) \textrm{, and}
    \\
    \x_{i,2}^{(l)} = h_2(\x_{i}^{(r)}) = \text{ReLU}(\mathbf{W}_2\x_i^{(r)}),
    \end{gathered}
\end{equation}
where $\mathbf{W}_1 \in \mathbb{R}^{l\times r}$, $\mathbf{W}_2 \in \mathbb{R}^{l\times r}$. We then follow the same process to generate outcome observations as linear DGP but replace the features vector $\x_i^{(r)}$ with latent feature vector $\x_i^{(l)}$. The event probabilities in non-linear DGP would subsequently become
\begin{equation}
    \begin{gathered}
    P(y_{i, 1} = 1 | \x_i) = \sigma(\bm{\theta}_1\x_{i,1}^{(l)}+\gamma_1) \textrm{, and}
    \\
    P(y_{i, 2} = 1 | \x_i) = \sigma(\bm{\theta}_2\x_{i,2}^{(l)}+\gamma_2).
    \end{gathered}
\end{equation}
Specifically, the non-linear DGP partially shares $l \times \rho$ latent features between $\x_{i,1}^{(l)}$ and $\x_{i,2}^{(l)}$. In other words, a subset proportion of $\rho$ is used to select subsets from mapping matrices $\mathbf{W}_1$ and $\mathbf{W}_2$, as well as the corresponding subsets from the coefficient vectors $\bm{\theta}_1$ and $\bm{\theta}_2$, which are designed to be identical.

The sample size of the synthetic datasets are $n=50,000$ for linear DGP experiments, and increase to $n=250,000$ for non-linear DGP experiments. The size of latent space for non-linear DGP is set as $l=5$ in our simulation experiments.

In experiments varying event similarity or event rate, we only modify the synthetic observations of common outcomes $y_2$, while maintaining the feature matrix $\mathbf{X}$ and observations of rare outcome $y_1$ consistent across different experiment setups with same random seed.

\subsection{Additional Details for Model Training}\label{apdx:exp-details-training}

In both simulation and real-world experiments, we conduct 10 iterations under every experimental setup. For each iteration, we either randomly generate a synthetic dataset, or conduct a random partitioning to generate the training and testing sets on the real-world datasets. The random seeds are always set to match with the iteration number.

We allocate 25\% of the samples from the training set for validation. In the validation for MLL, simulation experiments use aggregated performance across all outcomes as the criterion, whereas real-world experiments focus solely on the outcome of interest. The learning rate, batch size, and hidden layer size (exclusively for NN models), are pre-tuned and fixed as constant across iterations. The strength parameters for ridge regularization and similarity penalty were dynamically learned for each iterations by grid search based on the validation performance of AUC. Early stopping are also implemented to avoid overfitting based on validation performance of the cross-entropy loss.