Hierarchical Graph Tokenization for Molecule-Language Alignment

Yongqiang Chen; Quanming Yao; Juzheng Zhang; James Cheng; Yatao Bian

Hierarchical Graph Tokenization for Molecule-Language Alignment

Yongqiang Chen, Quanming Yao, Juzheng Zhang, James Cheng, Yatao Bian

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

TL;DR: We show that integrating the intrinsic hierarchical graph information is essential for molecule-language alignement.

Abstract: Recently, there has been a surge of interest in extending the success of large language models (LLMs) from texts to molecules. Most existing approaches adopt a graph neural network to represent a molecule as a series of node tokens for molecule-language alignment, which, however, have overlooked the inherent hierarchical structures in molecules. Notably, higher-order molecular structures contain rich semantics of functional groups, which encode crucial biochemical functionalities of the molecules. We show that neglecting the hierarchical information in tokenization will lead to subpar molecule-language alignment and severe hallucination. To address this limitation, we propose HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that encodes the hierarchy of atom, motif, and molecular levels of informative tokens to improve the molecular perception of LLMs. HIGHT also adopts an augmented instruction tuning dataset, enriched with the hierarchical graph information, to further enhance the molecule-language alignment. Extensive experiments on 14 real-world benchmarks verify the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks. The project is available at https: //higraphllm.github.io/.

Lay Summary: How can we teach Large Language Models (LLMs) like ChatGPT to understand graph-structured data like molecules? Molecules are complex, with structures that go beyond individual atoms — they have functional groups and motifs that define their behavior, such as how they interact in biological or chemical processes. Current LLMs that align molecular data with language often overlook these hierarchical structures, leading to hallucinations and misinterpretations, like falsely identifying functional groups. Our paper presents HIGHT, a new technique that captures the hierarchical structure of molecules, from atoms to motifs to the entire molecule. HIGHT uses a hierarchical graph tokenizer to explicitly teach LLMs to recognize and understand the intrinsic hierarchy. Our results show significant improvements in molecule-related tasks, like predicting chemical properties and generating accurate molecule descriptions. Notably, HIGHT reduces errors (or "hallucinations") in identifying functional groups by 40%. This work extends the perception of LLMs to graph-structured data, and lays a foundation for more reliable AI applications in drug discovery, material science, and chemistry.

Link To Code: https://higraphllm.github.io/

Primary Area: Deep Learning->Graph Neural Networks

Keywords: molecular-language alignment, large language models, hierarchical graph neural networks, tokenization, biomolecular studies, molecule

Submission Number: 3713

Loading