\section{Conclusion}\label{sec:conclusion}
We propose the use of common event tethering as a variation of regularized MLL optimized for rare event modeling. Our proposed methods, CET-LR and CET-NN, build on existing literature by coupling the learning of shared features via neural network with a regularization approach that shrinks rare event coefficients toward those of a more common event using several alternative measures of vector similarity. We provide rigorous supporting theoretical and empirical analyses showing the conditions under which CET methods are beneficial and exploring how leveraging more common events can lead to faster convergence rates. We support our findings with results on two real-world medical applications; first predicting rare cardiovascular morbidities in pregnant people with HDP and then predicting autism likelihood in early childhood. We provide proofs to our theoretical results as well as additional experimental results in the Appendix.

We conclude this paper with a brief commentary of important considerations when implementing our CET approach. We provide insight into \textit{finding surrogate events}, comment on important \textit{ethical considerations}, address the \textit{case when $M > 2$}, and outline \textit{limitations and future work}.

\paragraph{Finding Surrogate Events} In our real-world experiments, we saw greatest improvement in predicting a rare event (stroke) when we used a more common event (hypertensive crisis) that is known to be physiologically related and shares clinical risk factors \citep{pistoia2016hypertension}. In contrast, our other real-world examples used rare and common event combinations without a similarly strong known physiological link. This finding suggests that previous literature and domain knowledge should inform selection of suitable surrogate events.

\paragraph{Ethical Considerations and Bias} In this paper, we show the potential performance gains of tethering rare events to related, more common events. However, we note that it is important for researchers to consider biases in a given clinical context and setting that may be relevant to the use of CET. The naive use of CET or other MLL approaches could worsen existing biases by propagating bias from one (biased) outcome to another (less biased) outcome. 

For example, both autism and ADHD are more common in boys than girls, but the imbalance is greater for autism, and girls tend to be diagnosed less often and at a later age \citep{loomes2017male}. Therefore, tethering ADHD to autism could lead to a more biased model compared to training using ADHD outcomes alone. 
% Similarly, hypertensive-related adverse events are more common in non-Hispanic Black persons \citep{abrahamowicz2023racial}, but the degree of imbalance differs between specific outcomes. Therefore, if race or highly correlated covariates are used as predictors, a CET approach may result in miscalibrated predictions in the Black subpopulation. The ethical implications of this would depend on the details and the application, but in general, there is potential for harm.

\paragraph{More Than Two Outcomes} We explored using all four outcomes to help predict stroke in the maternal morbidity dataset by including a penalty for all pairs of coefficients. In this setup, we saw negligible improvement, as the AUC went from 0.670 without the CET penalty to 0.674 with the CET penalty (full results included in Appendix~\ref{apdx:exp_lr}). We hypothesize that including additional outcomes is more forgiving for feature learning improvement but is less beneficial for similarity-based penalty approaches when all the outcomes are not closely related, as is the case in the maternal morbidity dataset (see the pairwise comparisons in Figure~\ref{fig:FAER}). 

The baseline approach for $M>2$ incorporates a penalty for each pair of outcome coefficient vectors. However, with domain knowledge the framework could be modified to only incorporate penalty terms between the rare event and a select number of the most similar common events. We leave a more detailed exploration of this topic to future work. At this time, we advocate for targeted selection of a limited number of outcomes with related clinical etiology, especially when you consider the previously discussed risks of common event tethering .


\paragraph{Limitations \& Future Work}
The primary limitation of CET-LR and CET-NN is that using it effectively requires domain knowledge (\textit{i.e.}, clinical expertise) to select common events that share risk factors with rare events of interest. Results show that under typical conditions, our approach does not worsen performance even when the common and rare events are unrelated. Nevertheless, we believe standard MLL approaches may be more appropriate when such domain knowledge is not available.

Additionally, CET-LR and CET-NN effectiveness depends on a reasonable choice of the similarity penalty parameter $s$. While validation or domain knowledge can guide the choice of $s$, poorly chosen $s$ values may lead to suboptimal results.

% General statements regarding the asymptotic behavior of CET-LR, and especially CET-NN, are difficult to make due to the interaction between the separate outcome model's coefficients during the MLE process. \mme{I don't follow this sentence} Future work is necessary to provide more extensive results of these properties either through theoretical guarantees or empirical simulation.

Finally, whereas the current work does explore empirically the interaction betweeen feature learning and our proposed regularization term, in future work we will more rigorously explore whether benefits of CET-NN depend on the number of latent features (\textit{i.e.}, hidden layer width). We hypothesize that CET-NN provides greater benefit in the neural tangent kernel regime (\textit{i.e.}, wide hidden layer) \citep{jacot2018neural} and less benefit in the feature learning regime (\textit{i.e.}, narrow hidden layer).

%can enhance representation learning or if an overparametrized random feature learning approach is more beneficial. 

