SMI-TED: A large-scale foundation model for materials and chemistry

Published: 09 Apr 2025, Last Modified: 10 Apr 2025AI4MAT-ICLR-2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Submission Track: Full Paper
Submission Category: AI-Guided Design + Automated Synthesis
Keywords: foundation model, mixture-of-experts, qm9, Synthesis
Abstract: We present SMI-TED, a large-scale encoder-decoder foundation model for materials and chemistry, trained on 91 million SMILES samples from PubChem using self-supervised learning. Our encoder–decoder architecture supports a wide range of complex tasks, including the prediction of quantum chemical properties and reaction yields. We provide two model variants—289M and $8 \times 289$M parameters—to accommodate different use cases. SMI-TED achieves state-of-the-art performance across multiple benchmark datasets. Latent space analyses reveal signs of compositionality and separability—key properties for higher-level reasoning and few-shot learning. In particular, SMI-TED demonstrates its ability to capture chemically meaningful structure–property relationships without task-specific fine-tuning, as shown by the clustering of nitrogen-containing molecules with high HOMO energies. Compared to an encoder-only baseline, SMI-TED achieves a lower Davies–Bouldin index, highlighting the benefits of its reconstruction-based training objective. To support further research and applications, we publicly release the model weights and source code on HuggingFace and GitHub.
AI4Mat Journal Track: Yes
Submission Number: 15
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview