KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: molecular large language model, dataset, molecular representation learning
Abstract: The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/yzf1102/KnowMol-100K
Code URL: https://github.com/yzf-code/KnowMol
Primary Area: AL/ML Datasets & Benchmarks for life sciences (e.g. climate, health, life sciences, physics, social sciences)
Submission Number: 1745
Loading