MolVision: Molecular Property Prediction with Vision Language Models

Deepan Adak; Yogesh S Rawat; Shruti Vyas

MolVision: Molecular Property Prediction with Vision Language Models

Deepan Adak, Yogesh S Rawat, Shruti Vyas

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: Vision Language Models, Molecular properties

TL;DR: Visual and textual data to enhance molecular property prediction

Abstract: Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally uninformative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure images and textual descriptions to enhance property prediction. We construct a benchmark spanning nine diverse datasets, covering both classification and regression tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance. The code and data is available at : https://molvision.github.io/MolVision/.

Croissant File: zip

Dataset URL: https://huggingface.co/molvision

Code URL: https://molvision.github.io/MolVision/

Supplementary Material: pdf

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 1412

Loading