Position: Graph Matching Systems Deserve Better Benchmarks

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Data sets used in recent work on graph similarity scoring and matching tasks suffer from significant limitations. Using Graph Edit Distance (GED) as a showcase, we highlight pervasive issues such as train-test leakage and poor generalization, which have misguided the community's understanding and assessment of the capabilities of a method or model. These limitations arise, in part, because preparing labeled data is computationally expensive for combinatorial graph problems. We establish some key properties of GED that enable scalable data augmentation for training, and adversarial test set generation. Together, our analysis, experiments and insights establish new, sound guidelines for designing and evaluating future neural networks, and suggest open challenges for future research.
Lay Summary: We look at how current machine learning systems compare graphs, which is important in tasks like finding similar molecules or detecting fraud. We find that many widely-used datasets contain hidden overlaps between training and test data, sometimes nearing 100%, making it unclear whether models are truly learning meaningful patterns or just memorizing. We propose new ways to fix this and ensure fairer, more meaningful evaluation.
Verify Author Names: My co-authors have confirmed that their names are spelled correctly both on OpenReview and in the camera-ready PDF. (If needed, please update ‘Preferred Name’ in OpenReview to match the PDF.)
No Additional Revisions: I understand that after the May 29 deadline, the camera-ready submission cannot be revised before the conference. I have verified with all authors that they approve of this version.
Pdf Appendices: My camera-ready PDF file contains both the main text (not exceeding the page limits) and all appendices that I wish to include. I understand that any other supplementary material (e.g., separate files previously uploaded to OpenReview) will not be visible in the PMLR proceedings.
Latest Style File: I have compiled the camera ready paper with the latest ICML2025 style files <https://media.icml.cc/Conferences/ICML2025/Styles/icml2025.zip> and the compiled PDF includes an unnumbered Impact Statement section.
Paper Verification Code: OGE4M
Link To Code: https://anonymous.4open.science/r/better-graph-matching-7146/README.md
Permissions Form: pdf
Primary Area: Data Set Creation, Curation, and Documentation
Keywords: Improved benchmarking of GED tasks
Submission Number: 457
Loading