GRAIL: Graph Edit Distance and Node Alignment using LLM-Generated Code

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: GRAIL leverages LLMs and automated prompt engineering, guided by an evolutionary algorithm, to generate programs for accurate, cross-domain generalizable, and interpretable Graph Edit Distance computation.
Abstract: Graph Edit Distance (GED) is a widely used metric for measuring similarity between two graphs. Computing the optimal GED is NP-hard, leading to the development of various neural and non-neural heuristics. While neural methods have achieved improved approximation quality compared to non-neural approaches, they face significant challenges: (1) They require large amounts of ground truth data, which is itself NP-hard to compute. (2) They operate as black boxes, offering limited interpretability. (3) They lack cross-domain generalization, necessitating expensive retraining for each new dataset. We address these limitations with GRAIL, introducing a paradigm shift in this domain. Instead of training a neural model to predict GED, GRAIL employs a novel combination of large language models (LLMs) and automated prompt tuning to generate a *program* that is used to compute GED. This shift from predicting GED to generating programs imparts various advantages, including end-to-end interpretability and an autonomous self-evolutionary learning mechanism without ground-truth supervision. Extensive experiments on seven datasets confirm that GRAIL not only surpasses state-of-the-art GED approximation methods in prediction quality but also achieves robust cross-domain generalization across diverse graph distributions.
Lay Summary: Imagine you have a collection of known drug molecules and want to find which one is most similar to a newly discovered compound. To do this, scientists represent molecules as network diagrams (called graphs) where atoms are dots and chemical bonds are lines connecting them. They then measure how different two molecules are by counting the minimum number of changes needed to transform one molecular structure into another - this is called "Graph Edit Distance." The problem is that calculating this distance accurately is extremely time-consuming. For small molecules with 30-50 atoms, it can take hours or days. For larger molecules, it could take years using current computers. This makes it impractical for real-world drug discovery where scientists need to compare thousands of molecules quickly. Traditional computer methods tried to solve this by creating scoring tables that assign costs to different types of structural changes (like adding or removing atoms and bonds), then finding the lowest-cost way to transform one molecule into another. However, these methods weren't very accurate. More recent approaches used artificial neural networks, which are much more accurate but have a major drawback: they need expensive "correct answer" data to learn from. Since getting these correct answers requires the same slow calculations mentioned earlier, training these systems becomes extremely expensive and time-consuming. Our solution, called GRAIL, takes a completely different approach. Instead of training a neural network to predict the distance directly, we use Large Language Models (like GPT or Gemini) to write small computer programs that create better scoring tables. We then automatically select the best combination of these programs to minimize errors across our molecules. The key breakthrough is that GRAIL doesn't need those expensive "correct answers" for training. It can learn to make accurate predictions without requiring the time-consuming exact calculations that other methods depend on. This makes it both faster to develop and more practical for real-world use, while maintaining high accuracy comparable to neural network methods. This approach is particularly valuable for drug discovery, where researchers need to quickly identify promising molecular candidates from vast chemical databases. GRAIL can also be applied in areas like genetics, for comparing DNA structures and grouping similar viruses, as well as in computer science, for detecting similarities in source code to identify potential plagiarism.
Link To Code: https://github.com/idea-iitd/Grail
Primary Area: General Machine Learning->Sequential, Network, and Time Series Modeling
Keywords: Graph Edit Distance, Large Language Model, Code Discovery, Cross-domain generalization, Interpretability
Submission Number: 7294
Loading