\section{Common Event Tethering}\label{sec:method}

We consider the use of two models, \textit{\sinname (CET-LR)} and \textit{\mulname (CET-NN)}, that share information across rare events to improve performance. The sharing of information is encouraged by incorporating a regularization term that penalizes the difference between either the weights in a logistic regression model or the final layer weights in a neural network.

\subsection{CET Logistic Regression}\label{sec:single-step} 
% Logistic regression is a properly specified method to use when the relevant features in Equation~\ref{eq:log-model} are the measured patient characteristics. 
In Equation~\ref{eq:log-model}, if $h(\x_i) = \mathbf{H}\x_i$, for some matrix $\mathbf{H}\in\mathbbm{R}^{d\times p}$, logistic regression is correctly specified to learn the underlying probability models. For such cases, we introduce a \sinname (CET-LR) model. 

\sinabbr simultaneously maximizes the joint log-likelihood of $\bm{\theta}_1$ and $\bm{\theta}_2$ while incorporating a similarity penalty between the parameter vectors. Let $\bm{\theta} = [\bm{\theta}_1, \bm{\theta}_2]\in\mathbbm{R}^{2p}$. The log-likelihood of \sinabbr is

\begin{equation}\label{eq:sin-ll}
    \begin{gathered}
        \mathcal{L}^{(s)}(\bm{\theta} | \mathcal{D}_n) = \mathcal{L}(\bm{\theta} | \mathcal{D}_n) - \frac{1}{2}s\| \bm{\theta}_1 - \bm{\theta}_2 \|_2^2.
    \end{gathered}
\end{equation}
Here, $\mathcal{L}(\bm{\theta} | \mathcal{D}_n)$ is the unregularized log-likelihood of $\bm{\theta}$ and $s \geq 0$ is a constant used to control the strength of the similarity regularization term. We note that CET-LR is equivalent to \cite{lapedriza2007hierarchical} in the case when $M=2$.



\subsection{CET Neural Network}\label{sec:informed-nn}
We now consider the setting where $h(\x_i)$ maps to a set of latent features that are a non-linear combination of the input features in $\x_i$. As such, a logistic regression model on the input features will be underspecified. To combat this, we introduce \mulname (CET-NN) as an extension to CET-LR. \mulabbr fits an encoder, $\hat{h}_{\bm{\phi}}$, that maps the input features to a set of latent features. It then uses these latent features as input to a \sinabbr model.

The learning of $\hat{h}_{\bm{\phi}}$ and the $\bm{\theta}$ parameters of \mulabbr are done simultaneously in a standard neural network architecture. In this setup, $\bm{\theta}$ simply becomes the final layer weight matrix mapping to the outcome vector $\y_i$. In particular, we can write the unregularized log-likelihood as

\begin{equation}\label{eq:multi-unreg-ll}
    \begin{gathered}
        \mathcal{L}(\bm{\theta}, \bm{\phi} | \mathcal{D}_n) = \\ \sum_{i=1}^n \left[\y_i'\log\left(\begin{bmatrix}
        \hat{f}_1(\x_i) \\
        \hat{f}_2(\x_i)
         \end{bmatrix}
         \right) + \right.\\
         \left.\left(\begin{bmatrix}
             1 \\
             1 \end{bmatrix} - \y_i\right)'\log\left(\begin{bmatrix}
        1 - \hat{f}_1(\x_i) \\
        1 - \hat{f}_2(\x_i)
         \end{bmatrix}
         \right)\right].
    \end{gathered}
\end{equation}
where 
\[\hat{f}_1(\x_i) = \sigma(\bm{\theta}_1 \hat{h}_{\phi}(\x_i)), \textrm{ } \hat{f}_2(\x_i) = \sigma(\bm{\theta}_2 \hat{h}_{\phi}(\x_i)).\]
Then, the log-likelihood of CET-NN with similarity penalty is

\begin{equation}\label{eq:multi-ll}
    \begin{gathered}
        \mathcal{L}^{(s)}(\bm{\theta}, \bm{\phi} | \mathcal{D}_n) = \mathcal{L}(\bm{\theta}, \bm{\phi} | \mathcal{D}_n) - \frac{1}{2}s\| \bm{\theta}_1 - \bm{\theta}_2 \|_2^2.
    \end{gathered}
\end{equation}
where $s\geq 0$ is again a constant used to control the strength of the regularization term.




\subsection{Method Implementation}

We solve \sinabbr and \mulabbr by minimizing the negative log-likelihood. Having derived the log-likelihood of \sinabbr in Equation~\ref{eq:sin-ll} and \mulabbr in Equation~\ref{eq:multi-ll}, we write the optimization problem in Equation~\ref{eq:opt-func}.

\begin{equation}\label{eq:opt-func}
\begin{gathered}
        \min_{\bm{\theta}, \bm{\phi}} \left[-\ \mathcal{L}^{(s)}(\bm{\theta}, \bm{\phi} | \mathcal{D}_n) + \textrm{Reg}(\bm{\theta}) \right]
\end{gathered}
\end{equation}
where $\bm{\phi}$ is ommitted for \sinabbr. Reg$(\bm{\theta})$ is used to denote any additional regularization applied to the learned parameters. One may want to employ a general penalty term in addition to the similarity penalty to stabilize performance and further reduce overfitting. 

The $L_2$ similarity penalty in Equations~\ref{eq:sin-ll} and \ref{eq:multi-ll} can be replaced with any measure of distance/similarity between two vectors. The choice of this may vary depending on the application. Section~\ref{sec:simulations} presents experimental results using the $L_2$ and $L_1$ magnitude of the difference of $\bm{\theta}_1$ and $\bm{\theta}_2$ as well as the cosine-similarity between the two vectors.

Equation~\ref{eq:opt-func} can be solved using any applicable optimization approach such as stochastic gradient descent. The parameter $s$ can be learned using validation or set based on pre-existing knowledge of the similarity between the events of interest.