Low-Width Approximations and Sparsification for Scaling Graph Transformers

Hamed Shirzad; Balaji Venkatachalam; Ameya Velingker; Danica Sutherland; David Woodruff

Low-Width Approximations and Sparsification for Scaling Graph Transformers

Hamed Shirzad, Balaji Venkatachalam, Ameya Velingker, Danica Sutherland, David Woodruff

Published: 28 Oct 2023, Last Modified: 21 Dec 2023NeurIPS 2023 GLFrontiers Workshop PosterEveryoneRevisionsBibTeX

Keywords: Graph Transformers, Sparse Transformers, Sparsification, Self-Attention, GNN, Graphs, Geometric Deep Learning

TL;DR: We use a small-width network to estimate the attention scores, which we then use to sparsify the graph before training a larger network on the resulting sparse structure.

Abstract: Graph Transformers have shown excellent results on a diverse set of datasets. However, memory limitations prohibit these models from scaling to larger graphs. With standard single-GPU setups, even training on medium-sized graphs is impossible for most Graph Transformers. While the $\mathcal{O}(nd^2+n^2d)$ complexity of each layer can be reduced to $\mathcal{O}((n+m)d+nd^2)$ using sparse attention models such as Exphormer for graphs with $n$ nodes and $m$ edges, these models are still infeasible to train on training on small-memory devices even for medium-sized datasets. Here, we propose to sparsify the Exphormer model even further, by using a small ``pilot'' network to estimate attention scores along the graph edges, then training a larger model only using $\mathcal O(n)$ edges deemed important by the small network. We show empirically that attention scores from smaller networks provide a good estimate of the attention scores in larger networks, and that this process can yield a large-width sparse model nearly as good as the large-width non-sparse model.

Submission Number: 90

Loading