\section{Setup and Related Work}\label{sec:setup}
 
We consider the setting where we have a dataset $\mathcal{D}_n = \{(\x_i, \y_{i})\}_{i=1}^n$ of $n$ independent samples where, for each patient $i$, $\x_i$ is a $p$-dimensional feature vector and $\y_{i} \in \{0, 1\}^M$ is a vector of $M$ binary outcomes. We consider the case where $M=2$, though our results can easily be generalized to settings where $M > 2$. Without loss of generality, we let $y_{i, 1}$ be the label for a rare event of interest such that the event rate, $\frac{1}{n}\sum_{i=1}^n y_{i, 1}$, is low (e.g. 0.01, 0.001, ...) and we let $y_{i, 2}$ be the label for another more common, related event. 

We assume that the probabilities of $y_{i, 1}$ and $y_{i, 2}$ can be written as
\begin{equation}\label{eq:log-model}
    \begin{gathered}
    P(y_{i, 1} = 1 | \x_i) = \sigma(\bm{\theta}_1' h(\x_i)) \textrm{, and}
    \\
    P(y_{i, 2} = 1 | \x_i) = \sigma(\bm{\theta}_2' h(\x_i)).
    \end{gathered}
\end{equation}
In Equation~\ref{eq:log-model},  $\x_i$ is assumed to include a constant feature, $\sigma$ is the sigmoid function, $\bm{\theta}_1, \bm{\theta}_2 \in \mathbb{R}^d$, and $h:\mathbb{R}^{p} \to \mathbb{R}^d$ is a function that maps $\x_i$ to a $d$-dimensional vector of latent features.

Note that when $h(\x_i) = \x_i$ this simplifies to a standard logistic model. In this way, we view the probabilities of $y_{i, 1}$ and $y_{i, 2}$ as being determined by two separate logistic models on either the input features themselves or some latent features that we can learn. $\bm{\theta}_1$ and $\bm{\theta}_2$ then act as the $d$ dimensional parameter vectors for the corresponding outcome.

% \mme{Call it $p$-dimensional and note that $\bm{x}$ is assumed to include a constant feature. Also consider using sigmoid $\sigma(\cdot)$ notation rather than sigma.}

% where $x_{i,j}$ is the value of feature $j$ for unit $i$, $\beta^r_1,\cdots, \beta^r_p$ are the $p$ slope parameters and $\alpha^r$ is the intercept parameter of the logistic model for the rare event. Similarly $\beta^c_1,\cdots, \beta^c_p$ are the $p$ slope parameters and $\alpha^c$ is the intercept parameter of the logistic model for the more common event. Note that we can rewrite each model in vector notation as
% \[P(y_{i, 1} = 1 | \x_i) = \text{sigma}((\bm{\theta}^r)^T \mathbf{v}_i).\]


% If we can accurately estimate the parameter vector, $\bm{\theta}^r$, then we can determine the probability of an adverse event for any unit $i$ (i.e. $P(y_{i, 1} = 1 | \x_i)$). 

% In logistic regression (LR), the parameters of a logistic model are most commonly obtained via maximum likelihood estimation (MLE). We can write the log-likelihood for the rare event (and similarly for the common event) as
% \begin{equation*}
%     \begin{gathered}
%         \mathcal{L}(\bm{\theta}^r | \mathcal{D}_n) = \\
%         \sum_{(\x_i, y_{i, 1}, y_{i, 2})\in\mathcal{D}_n} y_{i, 1}\log(f^r(\x_i)) + (1 - y_{i, 1})\log(1 - f^r(\x_i))
%     \end{gathered}
% \end{equation*}
% where $f^r(\x_i) = \text{sigma}((\bm{\theta}^r)^T \mathbf{v}_i))$.

% \mme{I'd recommend paring down the description of LR and MLE for LR.}

\subsection{Logistic Regression \& L2 Regularization}
Logistic regression (LR) is a commonly employed technique for classification. However, LR is infamous for being biased and unstable in small sample sizes \citep{firth1993bias}. In the context of rare events, \cite{wang2020logistic} showed that the convergence rate of the LR MLE is much slower; specifically $O_p(n_1^{-\frac{1}{2}})$ where $n_1 = \sum_{i=1}^n y_{i, 1}$.\footnote{The standard convergence rate is $O_p(n^{-\frac{1}{2}})$ \citep{wang2020logistic}.} In turn, the amount of information available for rare event problems is directly related to the number of events in the dataset. This essentially decreases the effective size of a dataset, making LR for rare events particularly unstable and susceptible to problems with small sample sizes. To offset this unstable behavior, ridge regression was proposed by \cite{hoerl1970ridge} and later extended to LR by \cite{cessie1992ridge}. This ubiquitous technique places an $L_2$ penalty on model coefficients to decrease the variance of the estimate at the expense of a higher bias.

% by adding the penalty to the log-likelihood. The regularized log-likelihood then becomes
% \begin{equation}\label{eq:ridge}
%     \mathcal{L}^{\lambda_{ridge}}(\bm{\theta}^r | \mathcal{D}_n) = \mathcal{L}(\bm{\theta}^r | \mathcal{D}_n) - \lambda_{ridge}(\sum_{j=1}^p \beta_j^2)
% \end{equation}

% where $\lambda_{ridge}$ is a parameter signifying the strength of the regularization.\footnote{Note that in Equation~\ref{eq:ridge} the intercept term is omitted from the penalty as to not effect the baseline estimate. However, theoretical analysis of ridge LR typically either omits the intercept term from the model or includes it in the penalty (add citations). In practice, a biased intercept term does not effect ranking metrics such as AUC and can often be corrected for in post-processing steps to better calibrate the model (add another citation) \citep{puhr2017firth}.}

In the setting of rare events, where little information is present in the data, such regularization can lead to better out-of-sample performance \citep{pavlou2016review}. 
% Numerous variations of penalized logistic regression have been proposed with the similar goal of decreasing variance at the expense of increased bias \citep{puhr2017firth}. 
However, as noted by \cite{vsinkovec2021tune}, tuning of the penalty parameter can be unstable and ultimately penalization alone cannot overcome insufficient sample sizes or extreme class imbalance \citep{riley2020calculating, blagus2010class}.

% \mme{Although all of this is a nice write-up, we need to pare it down to focus on the key elements that are less well-understood yet relevant to our setup. Specifically, I think the Wang (2020) citation (slow convergence for rare events) and known benefits + limitations of L2 regularization for rare events (last paragraph) are the key pieces to retain.}


\subsection{Multi-label and Multi-task Learning}

One approach to combat the small effective sample size of rare events is to leverage information from related events. MLL methods are designed to predict a set of outcomes from a collection of input features, often sharing information between labels \citep{aly2005survey, zhang2013review, liu2021emerging}. Related to MLL methods, multi-task learning (MTL) methods can leverage information from different tasks trained on different datasets to improve performance across all tasks \citep{zhang2021survey}. In general, multi-label learning can be seen as a form of multi-task learning where the same dataset is used to learn about each task. 

We combine a specific form of regularization based information sharing with representation learning based information sharing to improve performance of rare event modeling. The idea of regularized MLL (or MTL) has been widely studied \citep{cao2019rmtl}. \cite{evgeniou2004regularized} and \cite{evgeniou2005learning} were among the first to explore the topic in the context of kernel estimators. \cite{zhou2012modeling} used a version of fused-lasso with an $L_1$ penalty between the coefficient vectors and more recent works like \cite{he2019efficient}, \cite{yu2020learning}, and \cite{alesiani2021towards} have imposed a variety of related penalties on coefficient vectors for scalable and interpretable MTL.
% \cite{zhou2011clustered} took a clustering approach to the problem while \cite{liu2012multi} focused on regression from a probabilistic framework. 
More recently, \cite{pmlr-v89-janati19a} used the Wasserstein distance for sparse regression, \cite{tang2020regularized} utilized evolutionary algorithms, and \cite{bai2022saliency} looked at regularization in multi-task deep learning problems. 

The growth of deep learning methods has led to representation learning being used extensively for MLL \citep{huang2019supervised, liu2021emerging}. Recent work has combined this approach with regularization based information sharing techniques by using methods such as manifold regularization \citep{zhu2021representation}, full-order label correlation \citep{chen2019multi}, shrinkage methods \citep{han2010multi}, and dimensionality-reduction \citep{huang2020multi}. 

% Regularized MTL and MLL have proved successful in computer vision \citep{luo2012manifold, liu2015single}, medical applications \citep{hossain2021pan, zhu2022physiomtl}, and behavioral and cognitive science \citep{xiao2019manifold}. 

% \mme{general medical benefits are less relevant here, but the rare event citations are key; I'd reduce the former but consider providing more details on the latter (e.g. Pouyanfar, Pillai).} 

Regularized MLL/MTL has become increasingly popular for medical applications \citep{hossain2021pan, zhu2022physiomtl}, but limited work has focused on rare event prediction. 
% \citep{pillai2023rare} used MTL to help predict rare life events, but did so in the context of a time series problem and was focused on anomaly detection versus prediction of a specific event. 
\cite{zhang2015predicting} proposed a regularized MLL approach for the prediction of drug side effects, using regularization to inform feature selection for an ensemble model. 
% \cite{li2015patient} used MLL to predict hypertensive complications. However, they used an SVM model with SMOTE to overcome the class imbalance. 
\cite{faletto2023predicting} considered a regularized ordinal regression method inspired by fused LASSO and designed to shrink towards proportional odds in settings where the rare event can be characterized as the most extreme of a set of ordered outcomes.
Most similar to our approach is \cite{lapedriza2007hierarchical} which imposes a penalty between the coefficient vectors of logistic regression models for related tasks. 
% Our approach builds on this by incorporating feature learning and several measures of vector similarity, then providing rigorous supporting theoretical and empirical analyses.
Our approach builds on this in two important ways. It is the first to explore, via both simulation and theory, the impact of event relatedness and event rate when using shrinkage penalties like those proposed by \cite{lapedriza2007hierarchical} and our method. Secondly, we incorporate feature learning via a neural network architecture to allow our approach to extend to more complicated non-linear setups. 

\subsection{Other Approaches}
There are various other approaches that are less related to our proposed method but can also be used for rare event prediction tasks. These approaches often employ similar regularization or information-sharing strategies but also include pre-processing steps and different machine learning model architectures.

Transfer learning is a well-known information sharing approach that aims to adapt a model trained on one task -- typically with plentiful data -- to a second task for which less data are available \citep{zhuang2020comprehensive}. Unlike MLL and MTL, transfer learning involves training on these tasks in sequence rather than simultaneously. We note that multi-label, multi-task, and transfer learning are not mutually exclusive. For example, transfer learning can be used to train a multi-class classifier on new tasks.

There are also a number of single-label learning approaches that have been employed for rare event prediction. Firth logistic regression is a widely-used method for prediction of imbalanced binary outcomes \citep{puhr2017firth}. It introduces a penalization term which helps to eliminate bias in parameter estimation when dealing with rare events \citep{olmucs2022comparison}. Ensemble-based machine learning algorithms, including gradient boosting and random forests, are also commonly used for rare event prediction due to their ability to model complex and non-linear relationships while also mitigating overfitting by drawing random subsamples during training \citep{shyalika2023comprehensive}.

Another traditional method to address data imbalance is using resampling algorithms, such as under-sampling (down-sampling) and over-sampling (up-sampling) \citep{barandela2004imbalanced}. Under-sampling removes examples from the majority class, while over-sampling replicates minority class samples to balance the dataset. However, the former may lead to loss of important information, so it is usually preferable to combine over-sampling with other techniques. For example, Synthetic Minority Over-sampling Technique (SMOTE) generates synthetic minority class samples based using a nearest neighbors approach \citep{elreedy2019comprehensive}. In practice, these data preprocessing techniques can be used together with our proposed approach and comparator methods. However, the exploration of this is outside the scope of this paper and we leave it to future work.
