Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We argue that graph learning needs a significant revision of benchmarks and benchmarking culture to stay relevant.
Abstract: While machine learning on graphs has demonstrated promise in drug design and molecular property prediction, significant benchmarking challenges hinder its further progress and relevance. Current benchmarking practices often lack focus on transformative, real-world applications, favoring narrow domains like two-dimensional molecular graphs over broader, impactful areas such as combinatorial optimization, databases, or chip design. Additionally, many benchmark datasets poorly represent the underlying data, leading to inadequate abstractions and misaligned use cases. Fragmented evaluations and an excessive focus on accuracy further exacerbate these issues, incentivizing overfitting rather than fostering generalizable insights. These limitations have prevented the development of truly useful graph foundation models. This position paper calls for a paradigm shift toward more meaningful benchmarks, rigorous evaluation protocols, and stronger collaboration with domain experts to drive impactful and reliable advances in graph learning research, unlocking the potential of graph learning.
Lay Summary: Graph learning is a field of machine learning concerned with processing objects describing relationships between entities, mathematically known as graphs. For example, a "molecular graph" represents a molecule by describing the relationships between its atoms, i.e., their chemical bonds. An example of a graph learning task is to predict chemically relevant properties for molecular graphs. The effectiveness of machine learning methods is typically measured on benchmarks. These are sets of controlled experiments on predefined data which allow to track relevant research progress. An ideal benchmark originates from, and is closely connected to, a real-world problem: strong benchmark results should indicate research methods can be successfully applied to real-world use cases. In this position paper we claim, however, that the progress of graph learning research is currently hindered by poor benchmarks, along with an unhealthy benchmarking culture. We argue that the most popular benchmarks are overly focussed on prediction tasks over molecular graphs, with benchmarks for other promising applications lacking. Additionally, we identify benchmarks which poorly model the relationships in the underlying data or miss to connect to relevant real-world use-cases. We also underscore how newly proposed methods are often compared against baseline approaches whose performances are not measured at the best of their possibility. We support many of the above claims with experimental evidence and provide multiple examples of real-world applications for which new graph learning benchmarks would be valuable. These include the design of computer chips, weather forecasting, applications to relational databases and combinatorial optimization problems with impacts in, e.g., logistics.
Link To Code: https://github.com/benfinkelshtein/PP-Benchmarks
Primary Area: Research Priorities, Methodology, and Evaluation
Keywords: MPNNs, GNNs, graphs, benchmarking
Submission Number: 5
Loading