ViTeGNN: Towards Versatile Inference of Temporal Graph Neural Networks on FPGA

Hongkuan Zhou, Bingyi Zhang, Rajgopal Kannan, Carl E. Busart, Viktor K. Prasanna

Published: 2025, Last Modified: 20 May 2025IEEE Trans. Parallel Distributed Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Temporal Graph Neural Networks (TGNNs) are powerful models to capture temporal, structural, and contextual information on temporal graphs, outperforming other methods in many high-impact downstream tasks. However, achieving high-performance TGNN inference in production environments is challenging because TGNN models suffer from high computation complexity and intrinsic temporal data dependency that hinders data parallelism. In addition, real-world TGNN applications have different latency and throughput requirements. This work presents ViTeGNN, a versatile TGNN inference solution for memory-based TGNNs on FPGAs. ViTeGNN performs algorithm-model-architecture co-design to meet the latency and throughput requirements of real-world TGNN applications. Besides the vanilla inference mode ViTeGNN-bal that updates embeddings for nodes interacting with others, we propose ViTeGNN-lat and ViTeGNN-thpt, optimized for latency and throughput. Our model optimizations include a lightweight method to compute attention scores and a related temporal neighbor pruning strategy to reduce computation and memory accesses. These are holistically coupled with key hardware optimizations that leverage the FPGA hardware. We propose a novel hardware module to execute the complex neighbor update process efficiently. To ensure similar accuracy vis-á-vis the original model, the simplified models are trained using the knowledge distillation technique. We propose a unified hardware design that supports all of these three inference modes without FPGA reconfiguration. Enabled by our flexible hardware architecture, we further propose ViTeGNN-auto, which automatically selects the best inference mode at runtime based on latency and throughput requirements, guided by our accurate performance model. We evaluate the performance of the proposed hardware accelerator on five real-world datasets. ViTeGNN-bal reduces the computation complexity by an average of 62% and memory accesses by an average of 36% with only 0.0042 accuracy loss. Compared with state-of-the-art implementations on CPU and GPU, our FPGA implementation achieves $53.9/26.0/16.1\times$ speedup and $8.2/4.0/2.5\times$ speedup for ViTeGNN-lat/-bal/-thpt, respectively.