\section{Preliminaries}
\label{sec:prelim}

In this section, we formally define the notation, the logistic regression framework, and the distributed learning setup.

\subsection{Binary Classification and Logistic Regression}
We consider a binary classification problem. Let $\mathcal{D}$ be a distribution over feature-label pairs $(x, y)$, where $x \in \mathbb{R}^d$ is a vector of features and $y \in \{0, 1\}$ is the binary target label.

We model the conditional probability $P(y=1|x)$ using the logistic function $\sigma(z) = \frac{1}{1 + e^{-z}}$. A hypothesis is parameterized by a vector $\theta \in \mathbb{R}^d$, yielding the predictor:
\begin{equation}
    p^{(\theta)}(x) = \sigma(\theta^T x).
\end{equation}
The quality of a predictor is measured by the expected Binary Cross Entropy (BCE) loss:
\begin{equation}
    L(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}} [y \log p^{(\theta)}(x) + (1-y) \log(1 - p^{(\theta)}(x))].
\end{equation}
The global Maximum Likelihood Estimator (MLE), denoted by $p^*$, corresponds to the parameters $\theta^*$ that minimize this loss over the full feature space.

\subsection{Distributed Learning Setup}
We consider a distributed learning setting with a set of $N$ agents, $\mathcal{A} = \{A_1, \dots, A_N\}$. The agents are organized in a Directed Acyclic Graph (DAG), $G = (\mathcal{A}, E)$, where an edge $(A_j, A_i) \in E$ indicates that agent $A_i$ receives information from agent $A_j$. We sometimes also write $A_j \to A_i$ to denote this relationship. We denote the set of parents of agent $A_i$ as $\mathrm{Pa}(A_i) = \{A_j \mid (A_j, A_i) \in E\}$. The agents learn in an order consistent with a topological sort of the DAG, with ties in the topological ordering broken arbitrarily.

Let $[d] = \{1, 2, \dots, d\}$ be the set of indices for $d$ total features. Each agent $A_i \in \mathcal{A}$ is associated with a specific subset of these features, $S_i \subseteq [d]$. For any agent $A_i$, its local view of the features is $x_{S_i}$, which is the sub-vector of $x$ corresponding to the features indexed by $S_i$. The agent also receives its parents' logits.

\subsection{Sequential Learning Protocol}
The agents learn models in a sequential manner. Unlike linear regression settings where agents minimize squared error, here each agent $A_i$ aims to train a model $f_i$ to minimize the local Binary Cross Entropy (BCE) loss.

Each agent $A_i$ observes its local features $x_{S_i}$ and the set of outputs from its parents. To preserve the information geometry of the exponential family, agents communicate their logits (the input to the sigmoid function) rather than their final probabilities. Let $z_j$ be the logit output by parent $A_j$, such that the parent's prediction is $\hat{p}_j = \sigma(z_j)$.

The model $p_i$ for agent $A_i$ is a logistic function of its local features and the parents' logits:
\begin{equation}
    p_i(x) = \sigma(z_i(x)) \; \text{where} \; z_i(x) = w_i^T x_{S_i} + \sum_{j \in \mathrm{Pa}(A_i)} v_{ij} z_j(x).
\end{equation}
Here, $w_i$ (weights for local features) and $v_{ij}$ (weights for incoming logits) are learnable parameters. Agent $A_i$ chooses these parameters to minimize the expected BCE loss $\mathbb{E}[L(p_i)]$.

The final output of the system is the prediction of the sink agent (or the last agent in the topological sort).

We use the notation $L(p)$, $L(\theta)$, and $L(z)$ interchangeably. Here, $\theta$ denotes the weight vector. We define $z(x) = \theta^T x$ and $p(x) = \sigma(z(x))$.
