Keywords: Multimodal Modeling, Graph–LLM Alignment, Molecule Understanding, Backbone-Free Tuning
TL;DR: EDT-Former: entropy-guided dynamic query tokens map molecular graphs to LLMs, capturing local and global structure features for comprehensive understanding and reasoning with backbone-free, connector-only training.
Abstract: Molecular understanding is central to advancing areas such as science and drug discovery, yet large language models (LLMs) struggle to understand molecular graphs effectively. Existing graph–LLM bridges often adapt a Q-Former–style connector with fixed-length static tokens originally designed for vision tasks. These designs overlook stereochemistry and substructural context and typically require costly LLM-backbone fine-tuning, limiting efficiency and generalization. We introduce EDT-Former, an Entropy-guided Dynamic Token Transformer that generates tokens aligned with informative molecular patches, preserving both local and global structural features for molecular graph understanding. Beyond prior approaches, EDT-Former enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone, resulting in computationally efficient fine-tuning, and it achieves state-of-the-art results on the MoleculeQA and Mol-Instructions benchmarks, underscoring its effectiveness for scalable and generalizable multimodal molecular understanding.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 15761
Loading