\section{Introduction}

The study of \textit{social learning in networks} addresses a fundamental question in distributed learning:
How do agents with partial and heterogeneous information aggregate their observations to form an accurate global belief?
This line of inquiry has a rich literature in economics and network science, commencing with the seminal work of DeGroot~\cite{degroot1974reaching}.
The DeGroot model conceptualizes learning as an iterative process of weighted averaging, where agents update their scalar beliefs based on the beliefs of their neighbors.
This heuristic approach was subsequently refined by Bayesian models of observational learning and information cascades, where rational agents infer private signals from the actions of their predecessors to reach a consensus or truth.
For example, Banerjee~\cite{banerjee1992simple} studies herding behavior in sequential decision-making; Bikhchandani et al.~\cite{bikhchandani1992theory} formalize informational cascades as a mechanism for fads and fashions; Gale and Kariv~\cite{galekariv2003bayesian} analyze Bayesian learning dynamics over social networks; and Golub and Jackson~\cite{golub2010naive} study conditions under which naive averaging aggregates information.

While classical social learning focuses on the aggregation of scalar estimates for a single state variable, modern applications increasingly demand \textit{networked machine learning}, where agents collaboratively learn high-dimensional hypothesis functions.
In this setting, aggregation entails reconstructing a complex predictive relationship---such as a classifier mapping a high-dimensional feature vector to a label---dispersed across a network.
Recently, Kearns et al.~\cite{kearns2026networked} introduced a formal framework for this problem, embedding learning agents in a Directed Acyclic Graph (DAG).
In their protocol, agents observe a local subset of features and the predictions of their parents, training a model to minimize a local loss function.
For linear regression under squared error, they demonstrated that such a process allows agents to achieve excess loss competitive with a global learner having access to all features, with the network depth acting as the critical resource for aggregation.

\subsection{Our Contributions}
In this work, we focus on the classification variant of the model of~\cite{kearns2026networked}.
We analyze a sequential learning protocol where agents optimize logistic regression models using local features and incoming logits from their parents in a DAG.
Our main contributions are as follows.

\textbf{Upper Bounds.}
We analyze information aggregation in a network of logistic regression agents.
We show that if a path of length $D$ satisfies a coverage condition---namely, every contiguous block of $M$ agents collectively observes all features.
In particular, we prove that the excess loss of the final agent scales as $O(M/\sqrt{D})$ (Theorem~\ref{thm:convergence}).

At a high level, our proof extends the analysis of~\cite{kearns2026networked} from squared loss to logistic loss.
The squared-loss argument relies on an exact variance decomposition, which does not carry over to Binary Cross-Entropy (BCE).
Instead, we certify progress via a KL/Bregman-type characterization of loss differences (Lemma~\ref{lem:pythagorean}) together with a Pinsker-style link from KL progress to prediction error (Lemma~\ref{lem:kl-mse}).
A key technical input is an orthogonality condition for BCE residuals (Lemma~\ref{lem:orthogonality}).
Finally, a stability (pigeonhole) argument identifies a segment of the path where improvement saturates; this forces small residuals and yields a bound on the deviation from the global optimum.

\textbf{Lower Bounds and Hard Instances}
We complement the upper bound by exhibiting a hard instance for the sequential logit-passing protocol.
On this instance, early features are uninformative about the label in isolation and only become useful after sufficiently many passes through the feature cycle.
We prove that the excess loss is lower-bounded by $\Omega(k/D)$ where $k$ is the dimension of the feature space, showing that network depth is not merely sufficient but necessary in the framework.

\subsection{From Regression to Classification}
Extending theoretical guarantees of networked aggregation from linear regression to binary classification is non-trivial.
The analysis in~\cite{kearns2026networked} relies fundamentally on the geometry of squared loss---in particular, orthogonality of residuals and a Pythagorean variance decomposition---which translate loss reduction directly into parameter-space convergence.
In contrast, binary classification via logistic regression and Binary Cross-Entropy (BCE) does not admit a comparable bias--variance decomposition.
The non-linearity of the sigmoid link introduces genuine geometric complications; in particular, linear aggregation in probability space is not equivalent to linear aggregation in feature space, motivating the architectural choice of passing logits rather than probabilities.

Despite these challenges, classification remains the primary modality for distributed applications ranging from medical diagnosis~\cite{vepakomma2018split} to decentralized fraud detection.
Establishing guarantees for information aggregation under BCE is therefore a central theoretical goal.

More broadly, this regression-to-classification gap is not unique to our networked setting: across several areas, techniques and guarantees that are clean for \emph{squared-loss regression} require genuinely different tools once the target is \emph{probabilistic classification}.
We list three representative pairs.
First, in \emph{sketching/subspace-embedding} methods, least-squares regression admits sharp oblivious sketching guarantees~\cite{clarkson2013lowrank}, whereas logistic objectives require non-$L_2$ progress measures and different proof techniques~\cite{munteanu2021oblivious}.
Second, in \emph{second-order / curvature-aware} acceleration via sketching, iterative Hessian sketch methods are cleanly analyzed for constrained least squares~\cite{pilanci2016ihs}, while Newton-sketch extensions for regularized ERM objectives (including logistic regression) require controlling curvature that depends on current predictions~\cite{pilanci2017newton}.
Third, in \emph{distribution-free predictive inference}, split conformal prediction yields regression intervals~\cite{lei2018conformal}, whereas classification requires set-valued prediction regions and different uncertainty objects~\cite{angelopoulos2021gentle}.
Taken together, these examples support the message most relevant to our paper: moving from regression to classification typically replaces Euclidean residual decompositions with KL/Bregman-flavored notions of progress, which becomes unavoidable when the information flowing through the network is itself a low-bandwidth probabilistic prediction.

\subsection{Related Work}
The most directly related body of work comes from \emph{Vertical Federated Learning (VFL)}, where different parties hold disjoint feature columns for the same aligned examples and collaborate to train a joint predictor.
The standard VFL setting is inherently \emph{interactive}: the survey of~\cite{yang2019federated} formalizes the feature-partitioned regime and highlights that most practical protocols rely on repeated message exchanges (e.g., gradients, activations, or protected sufficient statistics) to optimize a shared objective.
This view is reflected in deployed platforms such as FATE~\cite{liu2021fate}, which operationalize multi-round VFL pipelines and make explicit the practical tension between accuracy, privacy protection, and communication cost.
A particularly relevant algorithmic family is \emph{vertical} gradient-boosted tree training: SecureBoost~\cite{cheng2021secureboost} and later high-performance variants such as SecureBoost+~\cite{fan2024secureboostplus} show that strong predictors can be trained over vertically split features, but only by repeatedly coordinating split decisions through exchanging split-related information.
From the perspective of our setting, these works provide concrete evidence that \emph{feature partitioning} alone already induces a strong communication bottleneck; our work asks what can be guaranteed when the interaction budget is pushed much closer to its limit.
A complementary architectural line is \emph{split learning}~\cite{vepakomma2018split}, which avoids sharing raw features by cutting a neural network across parties and communicating intermediate activations/gradients.
While the mechanism is different from VFL-by-gradients or VFL-by-trees, it again emphasizes the same friction point that is central to our paper: learning can succeed under information-flow constraints, but the protocol must carefully manage what is transmitted and how many rounds are available.
Finally, recent surveys~\cite{ye2025vflreview, ye2025vflstructured} synthesize these lines and stress that communication (both message size and number of rounds) remains a dominant practical limitation in VFL, alongside privacy leakage and the statistical dependence patterns induced by feature/label partitioning; this framing closely matches the communication-centric viewpoint taken in our analysis.
