Abstract: Graph Neural Networks (GNNs), which learn representations of non-euclidean data, are rapidly rising in popularity and are used in several computationally demanding scientific applications. As these deep learning models become more prevalent in practical applications, their performance during inference becomes increasingly critical. GNNs have been shown to suffer from hard memory and computational bottlenecks on traditional hardware platforms (i.e. GPUs) due in part to their reliance on non-contiguous data structures. While dataflow architectures used by emerging hardware accelerators provide a potential solution to alleviate these issues, end-to-end GNN models are generally not yet supported by these platforms. Thus, it is not currently possible to directly compare the performance of GNNs on traditional GPUs with these hardware accelerators. In this work, we analyze the performance of operators relevant to modern GNNs on three platforms: NVIDIA A100 GPU, Groq GroqChip1, and SambaNova Reconfigurable Dataflow Unit (RDU). Specifically, we first profile several modern GNN models on traditional GPUs to determine the operators, fused kernels, and message passing layers most relevant to these architectures. Then, we systematically benchmark and analyze the performance for each of these levels of abstraction on each hardware platform. Our analysis shows that (1) due to their reliance on non-contiguous data, GNNs suffer from cache inefficiency on conventional GPUs (2) dataflow architectures, due in part to their cache-less design, are able to implicitly optimize for operators pertinent to GNNs and, (3) the RDU and GroqChip1 platforms enable significant inference speedup compared to traditional GPU on pertinent subsets of end-to-end GNN networks. Our open source code is available at https://github.com/ryienh/gnn-ops-benchmark.
Loading