GraTeD-MLP: Efficient Node Classification via Graph Transformer Distillation to MLP

Published: 16 Nov 2024, Last Modified: 26 Nov 2024LoG 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Graph Transformers, Knowledge Distillation, Node Classification
TL;DR: We propose a novel framework to effectively distill the attention present in a graph transformer onto a scalable MLP model
Abstract: Graph Transformers (GTs) like NAGphormer have shown impressive performance by encoding graph’s structural information and node features. However, their self-attention and complex architectures require high computation and memory, hindering their deployment. Thus, we propose a novel framework called Graph Transformer Distillation to Multi-Layer Perceptron (GraTeD-MLP). GraTeD-MLP leverages knowledge distillation (KD) and a novel decomposition of attentional representation to distill the learned representations from the teacher GT to a student MLP. During distillation, we incorporate a gated MLP architecture where two branches learn the decomposed attentional representation for a node while the third predicts node embeddings. Encoding the attentional representation mitigates the MLP’s over-reliance on node features, enabling robust performance even in inductive settings. Empirical results demonstrate that the proposed GraTeD-MLP has significantly faster inference time than the teacher GT model, with speed-up ranging from 20×−40×. With up to 25% improved performance over vanilla MLP. Furthermore, we empirically show that the proposed GraTeD-MLP outperforms other GNN distillation methods in seven datasets in both inductive and transductive settings
Submission Type: Full paper proceedings track submission (max 9 main pages).
Poster: png
Submission Number: 96
Loading