No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets

Corinna Coupette; Jeremy Wayland; Emily Simons; Bastian Rieck

No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets

Corinna Coupette, Jeremy Wayland, Emily Simons, Bastian Rieck

Published: 01 May 2025, Last Modified: 15 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce a framework for assessing the quality of graph-learning datasets by measuring differences between the original dataset and its perturbed representations.

Abstract: Benchmark datasets have proved pivotal to the success of graph learning, and *good* benchmark datasets are crucial to guide the development of the field. Recent research has highlighted problems with graph-learning datasets and benchmarking practices—revealing, for example, that methods which ignore the graph structure can outperform graph-based approaches. Such findings raise two questions: (1) What makes a good graph-learning dataset, and (2) how can we evaluate dataset quality in graph learning? Our work addresses these questions. As the classic evaluation setup uses datasets to evaluate models, it does not apply to dataset evaluation. Hence, we start from first principles. Observing that graph-learning datasets uniquely combine two modes—graph structure and node features—, we introduce RINGS, a flexible and extensible *mode-perturbation framework* to assess the quality of graph-learning datasets based on *dataset ablations*—i.e., quantifying differences between the original dataset and its perturbed representations. Within this framework, we propose two measures—*performance separability* and *mode complementarity*—as evaluation tools, each assessing the capacity of a graph dataset to benchmark the power and efficacy of graph-learning methods from a distinct angle. We demonstrate the utility of our framework for dataset evaluation via extensive experiments on graph-level tasks and derive actionable recommendations for improving the evaluation of graph-learning methods. Our work opens new research directions in data-centric graph learning, and it constitutes a step toward the systematic *evaluation of evaluations*.

Lay Summary: To assess the quality of new machine-learning *models*, researchers typically evaluate the performance of these models on a number of standard datasets. But how do we know that the *datasets* used for evaluation are any good? Our work addresses this question, focusing on a setting in which the data points are *graphs* and our task is to predict some of their graph-level properties ("graph learning"). In this setting, two types of information can be exploited, and a good graph-learning dataset should require both of them to be considered in solving a given learning task. We introduce a framework for measuring the extent to which this desirable property holds, finding that many popular datasets fail to meet our new quality standard. Our work highlights problems with established evaluation practices in (graph) machine learning, and it provides tools to improve these practices. Thus, we contribute to the development of more reliable machine-learning models.

Link To Code: https://github.com/aidos-lab/rings

Primary Area: Deep Learning->Graph Neural Networks

Keywords: data-centric graph learning, network data, dataset evaluation

Submission Number: 10280

Loading