Understanding Graph Self-Supervised Pre-training under Distribution Shifts: A Scaling Law Perspective

Bingheng Li; Shikun Liu; Yu Song; Jay Revolinsky; Haoyu Han; Subhabrata Mukherjee; Jiayuan Ding; Jiliang Tang

Understanding Graph Self-Supervised Pre-training under Distribution Shifts: A Scaling Law Perspective

Bingheng Li, Shikun Liu, Yu Song, Jay Revolinsky, Haoyu Han, Subhabrata Mukherjee, Jiayuan Ding, Jiliang Tang

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Graph Self-Supervised Pre-training, Scaling Law

Abstract: Scaling laws have played a fundamental role in the development of foundation models for NLP and vision, but their applicability to large-scale pretrained graph-based models remains unclear—particularly under distribution shifts intrinsic to graph data. In this work, we systematically investigate how model capacity and data scale affect downstream performance in graph pre-training under distribution shifts. To disentangle how distribution shifts impact the scaling, we construct synthetic benchmarks based on contextual stochastic block models, with precise control over both structural and feature-level shifts across the pre-training and testing graphs. Our initial experiments on GCN, a standard Graph Neural Network (GNN) baseline, reveal a striking asymmetry: increasing model capacity consistently improves performance, while increasing data size often degrades it, even under mild shift. We show that this degradation is not inevitable; properly configuring the pretraining model with deeper, wider, and transformer-based architectures enables favorable data scaling, even when distribution shifts. As data scales, graph transformer models achieve up to +9\% gains over GCN, which holds for both synthetic and real-world graph domain adaptation tasks. To explain this phenomenon, we develop a theoretical framework based on Fisher separability and Wasserstein domain divergence, which formally characterizes how distribution shifts affect representation transferability. Our results highlight architecture- and shift-aware strategies as the key to unlock scalable graph-based model pre-training.

Primary Area: learning on graphs and other geometries & topologies

Submission Number: 14567

Loading