Understanding Graph Self-Supervised Pre-training under Distribution Shifts: A Scaling Law Perspective
Keywords: Graph Self-Supervised Pre-training, Scaling Law
Abstract: Scaling laws have played a fundamental role in the development of foundation models for NLP and vision, but their applicability to large-scale pretrained graph-based models remains unclear—particularly under distribution shifts intrinsic to graph data. In this work, we systematically investigate how model capacity and data scale affect downstream performance in graph pre-training under distribution shifts. To disentangle how distribution shifts impact the scaling, we construct synthetic benchmarks based on contextual stochastic block models, with precise control over both structural and feature-level shifts across the pre-training and testing graphs. Our initial experiments on GCN, a standard Graph Neural Network (GNN) baseline, reveal a striking asymmetry: increasing model capacity consistently improves performance, while increasing data size often degrades it, even under mild shift. We show that this degradation is not inevitable; properly configuring the pretraining model with deeper, wider, and transformer-based architectures enables favorable data scaling, even when distribution shifts. As data scales, graph transformer models achieve up to +9\% gains over GCN, which holds for both synthetic and real-world graph domain adaptation tasks. To explain this phenomenon, we develop a theoretical framework based on Fisher separability and Wasserstein domain divergence, which formally characterizes how distribution shifts affect representation transferability. Our results highlight architecture- and shift-aware strategies as the key to unlock scalable graph-based model pre-training.
Primary Area: learning on graphs and other geometries & topologies
Submission Number: 14567
Loading