---
layout: distill
title: "Collaborative QA using Interacting LLMs: Impact of Network Structure, Node Capability and Distributed Data"
description: "We model and analyze how a network of interacting LLMs performs collaborative question-answering to estimate a ground truth given a distributed set of documents. We combine mean-field dynamics from network science with the randomized utility model from economics to construct a tractable generative model, prove fixed-point properties, and empirically study networks of 100 LLMs across three datasets."
htmlwidgets: true

# Anonymize when submitting
authors:
  - name: Anonymous

# Camera-ready: uncomment and fill in
# authors:
#   - name: Adit Jain
#     affiliations:
#       name: CollinearAI / Cornell University
#   - name: Vikram Krishnamurthy
#     url: "https://vikram.ece.cornell.edu/"
#     affiliations:
#       name: Cornell University
#   - name: Yiming Zhang
#     affiliations:
#       name: Cornell University

bibliography: submission.bib

toc:
  - name: Why Study Networks of LLMs?
  - name: The Interaction Protocol
  - name: Mean-Field Dynamics
    subsections:
    - name: The Population State ODE
    - name: The Randomized Utility Model
  - name: Theoretical Guarantees
  - name: Interactive Explorer
  - name: Empirical Insights
    subsections:
    - name: Test-Time Compute Scales Truth
    - name: Data Placement Matters
    - name: Network Topology Shapes Extremity
    - name: Stronger Hubs Spread Truth
  - name: Robustness and Sensitivity
  - name: Conclusion
---

## Why Study Networks of LLMs?

By mid-2025, roughly half of all internet articles involved LLM assistance <d-cite key="paredes2025articles"></d-cite>. This LLM-generated text feeds back into the training and context of other LLMs, creating implicit networks of interacting models. Meanwhile, explicit multi-LLM architectures — for programming, research, and question-answering — are becoming standard <d-cite key="mitchener2025kosmos"></d-cite><d-cite key="li2023camel"></d-cite><d-cite key="yao2023tree"></d-cite>.

A critical question emerges: **when LLMs interact in a network, does truthful information win out, or does hallucination spread?**

This paper provides both a theoretical framework and empirical answers. We model a directed network of $N$ LLMs performing **collaborative question-answering (CQA)**, where each LLM holds a partial, possibly misleading context, and must estimate a ground truth by combining its private evidence with its neighbors' opinions. We find that the answer depends on a precise interplay between network topology, node capability, and data placement.

<d-footnote>We define hallucination as reporting a state estimate that is not substantiated by the context and is not the ground truth.</d-footnote>


## The Interaction Protocol

Consider a directed graph $G = (V, E)$ of $N$ LLMs. Each node $i$ has a private observation $y_i$ (its context), a set of in-neighbors $\mathcal{N}(i)$ whose opinions it receives, and a current state estimate $\hat{x}_i \in \bar{\mathcal{X}} = \mathcal{X} \cup \{\text{Don't Know}\}$.

At each interaction round, nodes receive their neighbors' previous estimates alongside their own context and a control signal $u$ (representing test-time compute, communication budget, or model capability). Each LLM then produces an updated estimate with a rationale. As analysts with knowledge of the ground truth, we classify each LLM into one of three **latent states**: Truthful (T), Hallucinating (H), or Don't Know (D).

The interactive visualization below demonstrates this process. Press **Run** to watch information diffuse through the network. Try switching the data placement to see how placing correct context on influential vs. peripheral nodes changes the outcome.

<figure style="text-align: center; margin: 20px 0;">
    <iframe 
        src="{{ 'assets/html/submission/network_diffusion.html' | relative_url }}"
        width="100%" 
        height="520" 
        style="border: none; overflow: hidden; border-radius: 8px; box-shadow: 0 2px 8px rgba(0,0,0,0.08);"
        title="Interactive Network Diffusion">
    </iframe>
    <figcaption><strong>Figure 1.</strong> Interactive simulation of information diffusion in an LLM network. Each node is colored by its latent state: <span style="color:#27ae60;font-weight:bold">green</span> (truthful), <span style="color:#e74c3c;font-weight:bold">red</span> (hallucinating), <span style="color:#95a5a6;font-weight:bold">gray</span> (don't know). Node size reflects degree centrality. Transition probabilities follow a simplified RUM. Try different topologies and data placements to see how they affect convergence.</figcaption>
</figure>


## Mean-Field Dynamics

Tracking the joint state of $N$ interacting LLMs requires a state space of size $|\bar{\mathcal{X}}|^N$ — intractable for $N \geq 100$. We instead model the *population-level* behavior using **mean-field dynamics (MFD)** <d-cite key="yang2018meanfield"></d-cite><d-cite key="jackson2013diffusion"></d-cite>, which reduces the problem to a set of ordinary differential equations governing the evolution of the proportion of LLMs in each state.

### The Population State ODE

Let $\boldsymbol{\rho}^l \in \Delta$ denote the state distribution of LLMs with in-degree $l$, where $\Delta$ is the $|\bar{\mathcal{X}}|$-dimensional simplex. The population state evolves according to:

$$
\frac{d\boldsymbol{\rho}^l}{dt} = F^l(Q, \boldsymbol{\rho}, u) \, \boldsymbol{\rho}^l
$$

where $F^l$ is a rate matrix encoding transition rates for degree-$l$ nodes. The off-diagonal entries of $F^l$ aggregate over all possible neighbor configurations:

$$
G^l_{z_1 z_2}(Q, \boldsymbol{\rho}, u) = \sum_{\mathbf{n} : |\mathbf{n}|=l} \kappa_{z_1, z_2}(u, l, \mathbf{n}) \binom{l}{\mathbf{n}} \boldsymbol{\theta}(Q, \boldsymbol{\rho})^{\mathbf{n}}
$$

where $\kappa_{z_1, z_2}$ is the transition kernel — the probability of switching from state $z_1$ to $z_2$ given a specific neighbor configuration — and $\boldsymbol{\theta}_z(Q, \boldsymbol{\rho})$ is the probability that a random edge originates from a node in state $z$:

$$
\theta_z(Q, \boldsymbol{\rho}) = \frac{\sum_{m} \sum_{l} m \, Q(l, m) \sum_{l} \rho^l_z \, Q(l|m)}{\sum_{m} \sum_{l} m \, Q(l, m)}
$$

This expression is *size-biased by out-degree*: high-out-degree nodes contribute disproportionately to $\theta_z$, which is why network topology matters so much for information diffusion.


### The Randomized Utility Model

The key modeling decision is how to specify the transition kernel $\kappa_{z_1, z_2}$. Rather than using black-box estimates, we adopt the **Randomized Utility Model (RUM)** <d-cite key="mcfadden1974measurement"></d-cite>, treating each LLM as a rational agent maximizing a noisy utility:

$$
\bar{r}_z(u, l, \mathbf{n}, w, z_1) = \boldsymbol{\theta}^\top \boldsymbol{\phi}_z(u, l, \mathbf{n}, w, z_1) + \varepsilon, \quad \varepsilon \sim \text{Gumbel}(0,1)
$$

The Gumbel noise yields the **multinomial logit** transition rule:

$$
\kappa_{z_1, z_2}(u, l, \mathbf{n}, w) = \frac{\exp\bigl(\bar{r}_{z_2}(u, l, \mathbf{n}, w, z_1)\bigr)}{\sum_{z \in \bar{\mathcal{X}}} \exp\bigl(\bar{r}_z(u, l, \mathbf{n}, w, z_1)\bigr)}
$$

This choice is not arbitrary. It provides:
- **Interpretability**: the feature weights $\boldsymbol{\theta}$ reveal how much an LLM weighs neighbor consensus versus private context.
- **Analytical tractability**: the logit form satisfies smoothness properties needed for our fixed-point results.
- **Efficient estimation**: parameters are recovered via standard logistic regression.


## Theoretical Guarantees

For a two-state simplification ($\mathcal{X} = \{T, H\}$), we establish the following result under mild assumptions on the utility functions (monotone social influence, smoothness, non-degenerate switching, and positive incentive direction):

> **Theorem 1** (Fixed point and comparative statics). *Define $A_l(\theta; u) = \mathbb{E}[\kappa_{H,T}(u, l, M)]$ and $B_l(\theta; u) = \mathbb{E}[\kappa_{T,H}(u, l, M)]$ where $M \sim \text{Bin}(l, \theta)$. The edge-weighted map $\Phi(\theta; u, Q) = \frac{\sum_{l,m} m\, Q(l,m)\, \rho_l(\theta; u)}{\sum_{l,m} m\, Q(l,m)}$ satisfies:*
>
> *(i) $\Phi$ is continuous and non-decreasing, hence a fixed point $\theta^\star$ exists.*
>
> *(ii) Under a contraction condition, $\theta^\star$ is unique and globally asymptotically stable.*
>
> *(iii) The fixed point $\theta^\star(u)$ is non-decreasing in the incentive $u$: any control that raises the likelihood of truth in either state increases the equilibrium truth level, with effects amplified through high-out-degree nodes.*

Part (iii) has a direct practical implication: **investing in test-time compute or stronger base models always helps**, and the benefit is amplified when high-influence nodes receive the investment.


## Interactive Explorer

The visualization below lets you explore the MFD predictions. Adjust the power-law exponent $\gamma$, the initial proportion of truthful nodes $\rho_T(0)$, the incentive $u$, and the social influence strength to see how the population state trajectories change. Observe how the fixed point $\rho^\star_T$ responds to each parameter — confirming the monotonicity predicted by Theorem 1.

<figure style="text-align: center; margin: 20px 0;">
    <iframe 
        src="{{ 'assets/html/submission/mfd_explorer.html' | relative_url }}"
        width="100%" 
        height="440" 
        style="border: none; overflow: hidden; border-radius: 8px; box-shadow: 0 2px 8px rgba(0,0,0,0.08);"
        title="MFD Population State Explorer">
    </iframe>
    <figcaption><strong>Figure 2.</strong> Interactive mean-field dynamics explorer. The plot shows the evolution of population states $\rho_T$ (truthful, green), $\rho_H$ (hallucinating, red), and $\rho_D$ (don't know, gray dashed) over 30 interaction rounds. Adjust parameters to see how the fixed point shifts — higher incentive $u$ and initial truth always raise $\rho^\star_T$, consistent with Theorem 1(iii).</figcaption>
</figure>


## Empirical Insights

We validate the theoretical predictions on networks of **100 LLMs** (primarily LLaMA-3.1-8B) across three collaborative QA datasets: **Fiction** (30 Project Gutenberg books), **Knowledge Cutoff** (Wikipedia edits post-LLaMA-3 cutoff), and **Event** (100 news articles from Reuters, CNN, BBC in April 2025). In all experiments, 35% of nodes receive correct context; the rest receive incomplete, incorrect, or empty context.

We validate the MFD model by fitting the ODE to the first 150 interactions and predicting the remaining trajectory, achieving correlations $\geq 0.89$ and KL divergence $\leq 0.05$.


- Figure 1 (MFD validation plots): Figure_1.png
- Figure 2 (Deliberation rounds): tmlr_results_exp_2.png  
- Figure 3 (Context placement): tmlr_results_exp_3.png
- Figure 4 (Model heterogeneity): tmlr_results_exp_4.png
- Figure 5 (Number of agents): tmlr_results_exp_5.png
- Figure 6 (Power law exponent): tmlr_results_exp_7.png
- Table 1 image: table1_comm_overhead.png



### Test-Time Compute Scales Truth

Both communication overhead (measured in tokens) and deliberation rounds monotonically increase the truthful proportion $\rho_T$, with **diminishing returns** — consistent with the concavity implied by the logit structure in Theorem 1(iii). Increasing from answer-only to 100-token communication raises $\rho_T$ from 0.54 to 0.70 on the fiction dataset. Adding deliberation rounds (chain-of-thought + self-critique) yields further gains, and the effect holds even in heterogeneous networks where 20% of nodes run a stronger closed-source model.

<!-- {% include figure.html path="assets/img/submission/deliberation_rounds.png" class="img-fluid rounded" style="max-width:95%;height:auto;" caption="<strong>Figure 3.</strong> Evolution of population state ρ for different deliberation rounds across three datasets. Higher deliberation consistently increases ρ_T." %} -->


### Data Placement Matters

Placing the correct context on **high-influence nodes** (those with high degree centrality) substantially improves $\rho_T$ across all topologies — chain, power-law, and tree networks. In contrast, assigning correct data to peripheral nodes has minimal effect. This confirms that the edge-weighting by out-degree in $\Phi(\theta; u, Q)$ is not just a mathematical convenience but reflects real information dynamics.

<!-- {% include figure.html path="assets/img/submission/context_placement.png" class="img-fluid rounded" style="max-width:95%;height:auto;" caption="<strong>Figure 4.</strong> Context placement on influential vs. peripheral nodes across chain, power-law, and tree topologies." %} -->


### Network Topology Shapes Extremity

The power-law exponent $\gamma$ of the degree distribution controls how *extreme* outcomes are. **Lower $\gamma$ (more hub-dominated networks) produces more polarized results**: either very high or very low $\rho_T$, depending on whether hubs have correct or incorrect context. Higher $\gamma$ yields more moderate, predictable outcomes. This has direct design implications: if robustness is desired, use higher power-law exponents.

<!-- {% include figure.html path="assets/img/submission/power_law_exponent.png" class="img-fluid rounded" style="max-width:95%;height:auto;" caption="<strong>Figure 5.</strong> Histogram of final ρ_T across 100 questions for different power-law constants. Lower γ produces more extreme outcomes." %} -->


### Stronger Hubs Spread Truth

In heterogeneous networks mixing 3B and 8B parameter models, placing the **stronger model at high-centrality positions** significantly improves convergence to truth. The effect is consistent across all three datasets, suggesting that model capability and network position interact multiplicatively.

<!-- {% include figure.html path="assets/img/submission/model_heterogeneity.png" class="img-fluid rounded" style="max-width:95%;height:auto;" caption="<strong>Figure 6.</strong> Placing stronger LLMs (8B vs. 3B) at influential positions improves ρ_T across all datasets." %} -->


## Robustness and Sensitivity

We evaluate sensitivity to **linguistic perturbation** of the questions — applying lexical paraphrase, syntactic re-framing, and indirect formulation (8 combinations from the power set). Across 10 questions and 5 runs each, the population state $\rho_T$ is largely **robust to question framing**, with most perturbations producing less than 5% variation. This contrasts with the much larger effects of data placement and network structure, suggesting that the network's collective behavior is governed more by structural factors than surface-level linguistic variation.

These findings generalize across model families: experiments on Phi-4, Ministral-3 8B, Gemma-3 4B, and Gemma-3 12B confirm the same qualitative trends — truthful proportion increases with incentive, with diminishing returns.


## Conclusion

Networks of LLMs are becoming ubiquitous, yet their emergent behavior — particularly the spread of hallucination — remains poorly understood. This work provides both a tractable theoretical model (MFD + RUM) and systematic empirical evidence for how truth propagation depends on:

1. **Compute investment**: more test-time compute monotonically raises the truthful equilibrium.
2. **Data placement**: correct context on influential nodes is disproportionately valuable.
3. **Network design**: the power-law exponent controls outcome extremity; higher exponents yield more robust systems.
4. **Model capability at hubs**: placing stronger models at high-centrality positions amplifies truth propagation.

These insights provide actionable guidance for practitioners designing multi-LLM systems, and the MFD+RUM framework opens avenues for control-theoretic approaches to managing information flow in LLM networks.

Code and datasets are publicly available. The original PDF version of this paper, with full proofs and additional experiments, is available on OpenReview <d-cite key="jain2025preferential"></d-cite>.
