Unpacking Evaluation Pitfalls on Standard GNN Benchmarks

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: GNN, Evaluation Pitfalls, Imbalance, Heterophily, Node and Graph Classification, Evaluation criteria.
TL;DR: Unpacking evaluation pitfalls on standard GNN benchmarks, particular imbalanced datasets.
Abstract: Graph Neural Networks (GNNs) have achieved substantial progress in graph-structured learning, with recent innovations targeting heterophilic graphs and attention-based designs such as Graph Transformers. These models are typically evaluated on widely used standard benchmark datasets for node and graph classification. In this work, we identify a critical and often overlooked issue: these widely used benchmarks frequently suffer from significant class imbalance. Despite this prevalence, the GNN community predominantly relies on individual aggregate metrics namely \textit{standard accuracy} and \textit{AUROC}, on these datasets, often overlooking their limitations. While convenient, the existing aggregate measures could obscure class-level disparities and lead to incorrect conclusions about architectural effectiveness. Our work provides empirical evidence to demonstrate this limitation and advocate for a more robust evaluation framework that incorporates a diverse set of metrics (including balanced accuracy, AUPRC, and per-class metrics) to enable a transparent and reliable assessment of GNN capabilities.
Primary Area: datasets and benchmarks
Submission Number: 4399
Loading