Keywords: Graph Transformers, Sparse Transformers, Sparsification, Self-Attention, GNN, Graphs, Geometric Deep Learning
TL;DR: We use a small-width network to estimate the attention scores, which we then use to sparsify the graph before training a larger network on the resulting sparse structure.
Abstract: Graph Transformers have shown excellent results on a diverse set of datasets. However, memory limitations prohibit these models from scaling to larger graphs. With standard single-GPU setups, even training on medium-sized graphs is impossible for most Graph Transformers. While the $\mathcal{O}(nd^2+n^2d)$ complexity of each layer can be reduced to $\mathcal{O}((n+m)d+nd^2)$ using sparse attention models such as Exphormer for graphs with $n$ nodes and $m$ edges, these models are still infeasible to train on training on small-memory devices even for medium-sized datasets. Here, we propose to sparsify the Exphormer model even further,
by using a small ``pilot'' network to estimate attention scores along the graph edges,
then training a larger model only using $\mathcal O(n)$ edges deemed important by the small network.
We show empirically that attention scores from smaller networks provide a good estimate of the attention scores in larger networks,
and that this process can yield a large-width sparse model nearly as good as the large-width non-sparse model.
Submission Number: 90
Loading