SMI-TED: A large-scale foundation model for materials and chemistry

Emilio Vital Brazil; Eduardo Soares; Victor Yukio Shirasuna; Renato Cerqueira; Dmitry Zubarev; Kristin Schmidt

SMI-TED: A large-scale foundation model for materials and chemistry

Emilio Vital Brazil, Eduardo Soares, Victor Yukio Shirasuna, Renato Cerqueira, Dmitry Zubarev, Kristin Schmidt

Published: 09 Apr 2025, Last Modified: 10 Apr 2025AI4MAT-ICLR-2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Full Paper

Submission Category: AI-Guided Design + Automated Synthesis

Keywords: foundation model, mixture-of-experts, qm9, Synthesis

Abstract: We present SMI-TED, a large-scale encoder-decoder foundation model for materials and chemistry, trained on 91 million SMILES samples from PubChem using self-supervised learning. Our encoder–decoder architecture supports a wide range of complex tasks, including the prediction of quantum chemical properties and reaction yields. We provide two model variants—289M and $8 \times 289$M parameters—to accommodate different use cases. SMI-TED achieves state-of-the-art performance across multiple benchmark datasets. Latent space analyses reveal signs of compositionality and separability—key properties for higher-level reasoning and few-shot learning. In particular, SMI-TED demonstrates its ability to capture chemically meaningful structure–property relationships without task-specific fine-tuning, as shown by the clustering of nitrogen-containing molecules with high HOMO energies. Compared to an encoder-only baseline, SMI-TED achieves a lower Davies–Bouldin index, highlighting the benefits of its reconstruction-based training objective. To support further research and applications, we publicly release the model weights and source code on HuggingFace and GitHub.

AI4Mat Journal Track: Yes

Submission Number: 15

Loading