Improving Molecular Understanding of Large Language Model via Substructure-aware Instruction Tuning

ACL ARR 2026 January Submission4416 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: NLP Applications, AI4Science, Molecule Understanding
Abstract: Large Language Models (LLMs) have shown strong performance in molecular tasks, yet they often fail to capture fine-grained molecular information, particularly the presence of substructures and how they behave across diverse chemical contexts. Most existing approaches rely on surface-level cues, treating substructures as isolated markers rather than modeling their functional behaviors. We introduce SubMol-Instructions, a substructure-aware instruction tuning dataset that explicitly links molecular substructures to their functional behaviors across reaction prediction, property prediction, and molecule translation tasks. Building on this dataset, we propose StructMol, a molecule LLM that learns substructure behaviors using multiple 1D molecular representations. Experimental results on diverse chemical tasks show that our approach consistently outperforms state-of-the-art baselines, highlighting the importance of explicitly defining and learning substructural behaviors for improving fine-grained molecular understanding.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: NLP Applications,data augmentation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 4416
Loading