3DGrid-LLM: Token-Level Fusion of Language and 3D Grids for Chemical Multimodal Generation

Published: 20 Sept 2025, Last Modified: 29 Oct 2025AI4Mat-NeurIPS-2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: early-fusion, multimodal foundation model, 3D density grids, textual information
Abstract: We introduce 3DGrid-LLM, an early-fusion multimodal foundation model that integrates natural language with 3D density grids for molecular and materials science. The model extends a large decoder-only language model with discrete volumetric tokens from a 3D VQGAN, enabling unified token-level processing of spatial and textual information. Trained on diverse datasets, 3DGrid-LLM supports bidirectional generation, multimodal question answering, and retrieval-augmented 3D grid generation. Experiments show consistent improvements over baselines in multimodal VQA, semantic text generation, and property-aligned retrieval, demonstrating accurate and physically consistent outputs. This work establishes a scalable framework for incorporating physically grounded volumetric data into language models.
Submission Track: Paper Track (Full Paper)
Submission Category: AI-Guided Design
Institution Location: {Rio de Janeiro, Brazil},{San Jose, United States}
AI4Mat Journal Track: Yes
Submission Number: 38
Loading