LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

TMLR Paper9200 Authors

25 May 2026 (modified: 09 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Adding LLM-generated node features to graph neural networks is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when the LLM features are introduced through pure input concatenation, rather than joint training, distillation, or prompt-conditioning, they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. Under an MLP backbone with the Planetoid public split and BoW original features ($F_\text{orig}$), concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by $-17.0 \pm 0.3$ pp and Cora by $-4.3 \pm 0.6$ pp, with CiteSeer's $-0.6 \pm 0.8$ pp inside seed noise. The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders), and reverses on medium-homophily WikiCS ($+4.4$ pp) and ogbn-arxiv ($+11.7$ pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability $\Delta_\text{sig}$. Across 9 datasets, $\Delta_\text{sig}$ correlates with the concat cost more strongly than homophily at point estimate ($r^2 = 0.38$ vs. $0.06$; $N=9$ bootstrap CIs overlap). The bootstrap-best change-point is $\tau = 13.8$ pp (95% CI $[0, 13.8]$), and the rule "$\Delta_\text{sig} \leq \tau$ predicts non-positive concat cost" classifies 7/9 datasets correctly. Because 60% of bootstrap samples place $\tau$ inside $[5, 30]$ pp, we treat $\Delta_\text{sig}$ as an interpretive lens for the helping vs. hurting regimes rather than a precision pre-A/B filter. A dim-controlled ablation on PubMed places the LLM-feature drop between same-source PCA ($-2.3$ pp) and same-dim Gaussian noise ($-37.3$ pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations (seven training sizes $\times$ two encoder dimensions) fit a power-law profile $|\Delta_\text{concat}| \propto (\sqrt{d_l / n})^{1.31}$ with $r^2 = 0.97$ (PubMed-internal; Cora and CiteSeer have different slopes). The $\sqrt{d_l/n}$ profile and the $\Delta_\text{sig}$ threshold jointly describe a two-axis surface; the low-$\Delta_\text{sig}$, small-$n$ corner is exactly where the headline $-17$ pp PubMed deficit appears. In the low-$\Delta_\text{sig}$ regime, the most effective remediation is to drop the LLM channel entirely: the $F_\text{orig}$-only baseline strictly dominates every learned cheap fix at $p \approx 0.008$. A learnable scalar gate closes 89% of the raw-concat gap and is a useful second-line option when downstream pipelines structurally require $F_\text{LLM}$. The findings do not contradict the aggregate accuracy gains reported for end-to-end LLM pipelines such as TAPE and GLEM; they identify the specific design choice (pure concatenation) under which the sign flips.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Chuxu_Zhang2

Submission Number: 9200

Loading