What Improves the Generalization of Graph Transformer? A Theoretical Dive into Self-attention and Positional Encoding

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Graph Transformer, deep learning theory, generalization analysis, optimization, Graph neural network, Transformer
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This paper provides a theoretical understanding of shallow Graph Transformers, including the learning and generalization analysis with sample complexity. We show the mechanism of how self-attention and positional encoding enhance generalization.
Abstract: Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces the first theoretical investigation of a shallow Graph Transformer for semi-supervised node classification, comprising a self-attention layer with relative positional encoding and a two-layer perception. Focusing on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant, we characterize the sample complexity required to achieve a zero generalization error by training with stochastic gradient descent (SGD). Our theoretical findings suggest that a larger fraction of discriminative nodes, a clearer-cutting vote among discriminative nodes, a smaller fraction of erroneous labels, and smaller errors in the initial model and node patterns improve generalization. Furthermore, we demonstrate that self-attention and positional encoding enhance generalization by making the attention map sparse and promoting the core neighborhood during training, which explains the superior feature representation of Graph Transformers. Our theoretical results are supported by empirical experiments on synthetic and real-world benchmarks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6089
Loading