\section{Theoretical Results}\label{sec: theory}
In this section, we conduct a theoretical analysis of our proposed method. We first describe a two-step approach to \sinabbr and derive asymptotic properties of this estimator. This setup allows us to compare directly to unregularized and ridge LR (Section~\ref{sec:two-step-asym}). We also use this setup to illustrate the efficiency advantages of utilizing related outcomes that occur more frequently (Section~\ref{sec:fin-sample-eff}).

We proceed to derive the asymptotic properties of CET-LR. We establish that this estimator is asymptotically unbiased with a lower asymptotic mean-squared error than LR when the slope parameters of the outcome models are the same and the similarity penalty is greater than zero.

Finally, we discuss how with additional assumptions on the latent features these theoretical results can be extended to \mulabbr.

% Finally, focusing back on the case of a finite sample size, we present results suggesting how our informed logistic regression benefits in terms of efficiency by utilizing both events.

\subsection{Two-Step Approach}
We consider a two-step variation of LR that incorporates a penalty term between the current parameters being estimated and parameters previously estimated with LR for another outcome. It is akin to ridge LR where the the estimated parameters are penalized by their distance from the parameters of a similar event, rather than by their distance from zero. 

We let $\hat{\bm{\theta}}^{(k)}_{2}$ denote our parameter estimates for the more common event that are estimated from ridge LR with penalty parameter $k$. For simplicity, we omit the $k$ superscript from this estimate and let $\hat{\bm{\theta}}_{2} = \hat{\bm{\theta}}^{(k)}_{2}$.

We then estimate the parameters $\bm{\theta}_1$ of our rare event logistic model by maximizing the log-likelihood

\begin{equation}
    \begin{gathered}
        \mathcal{L}^{(s)}(\bm{\theta}_1 | \mathcal{D}_n, \hat{\bm{\theta}}_{2}) = \mathcal{L}(\bm{\theta}_1 | \mathcal{D}_n) - \frac{1}{2}s\|\bm{\theta}_1 - \hat{\bm{\theta}}_2 \|_2^2,
    \end{gathered}
\end{equation}
where $\mathcal{L}(\bm{\theta}_1 | \mathcal{D}_n)$ is the unregularized log-likelihood and $s$ is the similarity penalty parameter that controls the strength of the penalization term. We let $\hat{\bm{\theta}}^{(s)}_{1}$ denote this estimate but once again omit the $s$ superscript for the majority of this paper, letting $\hat{\bm{\theta}}_{1} = \hat{\bm{\theta}}^{(s)}_{1}$.\footnote{We mention this superscript notation for use in Section~\ref{sec:fin-sample-eff}.}

% \mme{consider cleaning up a bit by removing the $\eta_\text{ridge}$ subscript (make sure you're consistent with either $\lambda$ or $\eta$) and utilizing $||\cdot||_2^2$ notation. I would remove the subscript throughout; you can note that the dependence on $\eta$ is omitted for clarity of presentation. In cases where you need $\eta$ (or $\lambda$), I'd choose a single-letter subscript rather than writing out e.g. ridge each time.}

Like ridge LR, this two-step approach to \sinabbr introduces an $L_2$ penalty term into the log-likelihood. In doing so, it maintains the desirable properties of ridge regularization such as avoiding overfitting and handling multicollinearity. However, rather than pulling the coefficients of $\bm{\theta}_1$ towards zero, it pulls them towards the estimated coefficients of the more common event.

\subsection{Two-Step Asymptotic Properties}\label{sec:two-step-asym}

Theorem~\ref{thm:asymptotic} establishes the asymptotic bias and variance of the two-step \sinabbr approach.


\begin{restatable}[Two-Step \sinabbr Asymptotic Properties]{theorem}{twoasym}
\label{thm:asymptotic}
% \begin{theorem}\label{thm:asymptotic}[Two-Step \sinabbr Asymptotic Properties]
Let $\mathbbm{E}[\hat{\bm{\theta}}_2]$ and $\text{Var}[\hat{\bm{\theta}}_2]$ be the asymptotic expectation and variance of the estimate of $\bm{\theta}_2$. Then the MLE estimate of $\mathcal{L}^{(s)}(\bm{\theta}_1 | \mathcal{D}_n, \hat{\bm{\theta}}_2)$, $\hat{\bm{\theta}}_1$, has asymptotic bias
\begin{equation}\label{eq:asym-bias}
    \begin{gathered}
    \mathbb{E}[\hat{\bm{\theta}}_1 - \bm{\theta}_1] =
    -s\left(\bm{\Omega}(\bm{\theta}_1) + s\mathbf{I}\right)^{-1}
    \left[\bm{\theta}_1 - \mathbbm{E}[\hat{\bm{\theta}}_2]\right]
    \end{gathered}
\end{equation}    
and asymptotic variance
\begin{equation}\label{eq:asym-var}
    \begin{gathered}
        \text{Var}[\hat{\bm{\theta}}_1] = \\
        \left(\bm{\Omega}(\bm{\theta}_1) + s\mathbf{I}\right)^{-1}
        \left(\bm{\Omega}(\bm{\theta}_1) + s^2 \text{Var}[\hat{\bm{\theta}}_2] \right)
        \left(\bm{\Omega}(\bm{\theta}_1) + s\mathbf{I}\right)^{-1}.
    \end{gathered}
\end{equation}
Here, $\bm{\Omega}(\cdot)$ is the negative of the hessian matrix and $\bm{\theta}_1$ is the true parameter vector of event 1. $\mathbf{I}$ is a $p\times p$ identity matrix.
% \end{theorem}
\end{restatable}

We observe from Equation~\ref{eq:asym-bias} in Theorem~\ref{thm:asymptotic} that, compared to ridge LR, two-step \sinabbr can decrease the asymptotic bias of the estimated parameter vector of event 1 ($\hat{\bm{\theta}}_1$) if it's true parameter vector ($\bm{\theta}_1$) is closer to the asymptotic expectation of event 2 ($\mathbbm{E}[\hat{\bm{\theta}}_2]$) than it is to the zero vector.\footnote{The asymptotic bias of $\hat{\bm{\theta}}_1$ estimated with ridge LR and penalty parameter $s$ is $\mathbb{E}[\hat{\bm{\theta}}_1 - \bm{\theta}_1] = -s\left(\bm{\Omega}(\bm{\theta}_1) + s\mathbf{I}\right)^{-1}\bm{\theta}_1$.} In particular, we note that the estimate of two-step \sinabbr is asymptotically unbiased if the true parameters of event $1$ ($\bm{\theta}_1$) equal the asymptotic expectation of the estimated parameters of event $2$ ($\mathbbm{E}[\hat{\bm{\theta}}_2]$). 

However, two-step \sinabbr does incur slightly higher variance from using the estimates of $\bm{\theta}_2$ in its regularization term.\footnote{The asymptotic variance of $\hat{\bm{\theta}}_1$ estimated with ridge LR and penalty parameter $s$ is 
$\text{Var}[\hat{\bm{\theta}}_1] = \left(\bm{\Omega}(\bm{\theta}_1) + s\mathbf{I}\right)^{-1}
\bm{\Omega}(\bm{\theta}_1)\left(\bm{\Omega}(\bm{\theta}_1) + s\mathbf{I}\right)^{-1}$.}
% \mme{add a bit more illustrating why this follows from the equation above. By the way, where are the equation numbers? I think you need to add equation environments to your theorem statements. Need like 2 more sentences.} 
We quantify when this exchange of decreased bias for higher variance is beneficial by comparing the asymptotic mean-squared error (MSE) of the two-step \sinabbr estimator versus ridge LR in Theorem~\ref{thm:mse}.

\begin{restatable}[Two-Step \sinabbr vs. Ridge LR MSE]{theorem}{twomse}
\label{thm:mse}
% \begin{theorem}\label{thm:mse}[Two-Step \sinabbr vs. Ridge LR MSE]
Let $\tilde{\bm{\theta}}_1$ be the ridge LR estimate of $\bm{\theta}_1$ with ridge penalty parameter $s$. And let $\hat{\bm{\theta}}_1$ be the two-step \sinabbr estimate with similarity parameter also $s$ and $\hat{\bm{\theta}}_2$ the estimate for event $2$ used in the penalty term. As in Theorem~\ref{thm:asymptotic}, let $\mathbbm{E}[\hat{\bm{\theta}}_2]$ be the asymptotic expectation of $\hat{\bm{\theta}}_2$.

We let $\bm{\theta}_1$ and $\bm{\theta}_2$ be the true parameter vectors for events 1 and 2 respectively, and assume that there exists an orthogonal matrix $\mathbf{P}$ such that $\bm{\Omega}(\bm{\theta}_1) = \mathbf{P}\mathbf{A}\mathbf{P}'$ and $\bm{\Omega}(\bm{\theta}_2) = \mathbf{P}\mathbf{B}\mathbf{P}'$ for diagonal matrices $\mathbf{A}$ and $\mathbf{B}$. 

We then let $\mathbf{a} = \mathbf{P}\bm{\theta}_1$ and $\mathbf{b} = \mathbf{P}\mathbbm{E}[\hat{\bm{\theta}}_2]$ be the projections of $\bm{\theta}_1$ and $\mathbbm{E}[\hat{\bm{\theta}}_2]$ onto the column space of $\mathbf{P}$.

Denoting $\text{MSE}$ as the asymptotic mean-squared error of an estimator, we find that
\begin{equation}\label{eq:mse-ineq}
    \begin{gathered}
    \text{MSE}\left(\hat{\bm{\theta}}_1\right) < \text{MSE}\left(\tilde{\bm{\theta}}_1\right)
    \end{gathered}
\end{equation}    
when 
\begin{equation}\label{eq:mse-diff}
    \begin{gathered}
        b_{j}\left(2a_{j} - b_{j}\right) > \frac{B_{j, j}}{(B_{j, j} + k)^2}
    \end{gathered}
\end{equation}
for all $j\in\{1, p\}$.

The above is a sufficient, but not necessary, condition. If we denote the left-hand side of Equation~\ref{eq:mse-diff} as $\eta_j$ and the right-hand side as $\beta_j$, and further let $\alpha_j = \frac{1}{(A_{j, j} + s)^2}$, a more relaxed condition sufficient to imply Equation~\ref{eq:mse-ineq} is that 
\begin{equation}\label{eq:mse-shorthand}
    \begin{gathered}
        \sum_{j=1}^{p}\alpha_j\eta_j
        >
        \sum_{j=1}^{p}\alpha_j\beta_j.
    \end{gathered}
\end{equation}

\end{restatable}

Theorem~\ref{thm:mse} establishes a relationship between the reduced bias and added variance of using estimated parameter values of a related event as a baseline for regularization. In essence, this theorem shows that the degree to which the parameter vector for event 1 is closer to the parameter vector for event 2 than it is to the zero vector must be enough to outweigh the added variance of estimating the parameters for event 2. In practice, this suggests that a tethering approach can be beneficial when a more common event's parameters can be estimated with low variance and are believed to be similar to the parameters for a rare event of interest. However, tethering to a more common event may not be a good idea if the variance of the common event's estimated parameters is large. We include the proof for Theorem~\ref{thm:mse} and expand further on its implications in Appendix~\ref{apdx:proofs}

The formalization of just how close $\bm{\theta}_1$ needs to be to $\mathbbm{E}[\hat{\bm{\theta}}_{2}]$ is made difficult due to the complex behavior of MLE solvers. Using intuition drawn from \citep{keskar2016large}, one can view the eigenvalues (i.e. diagonal values of $\mathbf{A}$ and $\mathbf{B}$) as characterizing the sharpness of the minimizer along its corresponding eigenvector in $\mathbf{P}$. In this case, for \sinabbr to improve upon ridge RL, $\mathbf{a}$ needs to be closer to $\mathbf{b}$ than $\bm{0}$, particularly in the crucial directions for loss minimization. 

% \hl{In practice, we show in Section~\ref{sec:exp} that this results in SLR having only marginal gains over ridge RL when regression coefficients are normally distributed around zero with low variance. Whereas, as coefficient values move farther from zero SLR sees large gains over ridge RL.}

\subsection{Finite Sample Efficiency}\label{sec:fin-sample-eff}

It has long been understood that rare event prediction is hindered by the lack of positive samples. Such behavior can be understood by observing the asymptotic efficiency of MLE estimators for rare events. Whereas the unregularized MLE of LR converges at a rate of $n^{-\frac{1}{2}}$, \cite{wang2020logistic} showed that the 
the convergence rate of $\hat{\bm{\theta}}_1$ is $O_p(n_1^{-\frac{1}{2}})$, where $n_1 = \sum_{i=1}^n y_{i, 1}$. To better understand the benefit of sharing information between rare events in finite samples we decompose the two-step \sinabbr into two components. To do so, we reintroduce the superscript $s$ into the estimate $\hat{\bm{\theta}}^{(s)}_1$ and let $\hat{\bm{\theta}}_1$ and $\hat{\bm{\theta}}_2$ denote the unregularized LR estimates of the parameter vectors. With this notation, we can write

\begin{equation}
    \begin{gathered}
        \hat{\bm{\theta}}^{(s)}_1 = 
        \left(\bm{\Omega}^{(s)}(\bm{\theta}_1)\right)^{-1}\bm{\Omega}(\bm{\theta}_1)\hat{\bm{\theta}}_1 +
        s \left(\bm{\Omega}^{(s)}(\bm{\theta}_1)\right)^{-1}\hat{\bm{\theta}}_2
    \end{gathered}
\end{equation}

where $\bm{\Omega}^{(s)}(\bm{\theta}_1) = \bm{\Omega}(\bm{\theta}_1) + s\mathbf{I}$.


If we let $n_2 = \sum_{i=1}^n y_{i, 2}$ and assume that $n_2 > n_1$, then there is more information in $\mathcal{D}_n$ about the more common outcome, $y_{i,2}$, than the rarer outcome, $y_{i,1}$. In terms of convergence rates, we note that $\hat{\bm{\theta}}_2 = O_p(n_2^{-\frac{1}{2}})$ converges faster than $\hat{\bm{\theta}}_1=O_p(n_1^{-\frac{1}{2}})$. We further observe that as $s$ increases, the coefficient values of $\hat{\bm{\theta}}^{(s)}_1$ becomes more strongly composed of $\hat{\bm{\theta}}_2$ than $\hat{\bm{\theta}}_1$. From here, we see how, when we believe $\bm{\theta}_1$ and $\bm{\theta}_2$ to be similar, imposing a strong similarity regularization term can help with faster convergence and ultimately result in better performance on real-world datasets.

\subsection{\sinabbr Asymptotic Properties}
We derive the asymptotic properties of the \sinabbr estimator in Appendix~\ref{apdx:asym-props-single} which we use to prove Theorem~\ref{thm:single-step-mse}.


\begin{restatable}[\sinabbr Asymptotic MSE]{theorem}{sinmse}
\label{thm:single-step-mse}
% \begin{theorem}\label{thm:single-step-mse}[\sinabbr Asymptotic MSE]
If $\bm{\theta}_1 = \bm{\theta}_2$, then for any $s' > s \geq 0$ the asymptotic MSE of the MLE estimate of $\bm{\theta} = [\bm{\theta}_1, \bm{\theta}_2]$ is less under the log-likelihood of $\mathcal{L}^{(s')}(\bm{\theta} | \mathcal{D}_n)$ than $\mathcal{L}^{(s)}(\bm{\theta} | \mathcal{D}_n)$.
\end{restatable}

Theorem~\ref{thm:single-step-mse} shows that when $\bm{\theta}_1 = \bm{\theta}_2$ \sinabbr is a better estimator than unregularized LR (i.e. $s=0$) and continues to improve as $s$ grows. In fact, as $s\rightarrow\infty$, the asymptotic MSE of \sinabbr for $\bm{\theta} = [\bm{\theta}_1, \bm{\theta}_2]$ approaches that of LR for just $\bm{\theta}_1$.\footnote{See the proof of Theorem~\ref{thm:single-step-mse} in Appendix~\ref{apdx:proofs} for the calculation of the asymptotic MSE of \sinabbr.}

\subsection{Extension to \mulabbr}

Theoretical analysis of feature learning in neural networks is notoriously challenging and beyond the scope of this work. However, we note that the preceding theoretical analyses apply not only in the linear case, but also in the case where nonlinear DGP features are faithfully recovered (up to permutation), as this assumption reduces learning to logistic regression on latent rather than raw features. Thus, understanding differences in benefits of \sinabbr and \mulabbr depends on a theoretical account of feature learning in the MLL setting.

%but also in the random features model of \citet{mei2022generalization}, in which latent features are fixed after initialization, and only the second layer weights are learned.


%\hl{@Matt Need to add this brief discussion.}
