Chemical Priors at Scale: Efficient Foundation Models without Big Corpora

20 Sept 2025 (modified: 08 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Molecular language modeling, chemically-informed self-supervision, scientific foundation models
Abstract: Deep learning models have shown themselves as a powerful solution for molecular property prediction, yet they are underutilized in real-world applications. These models, while powerful, lack chemical interpretability to link predicted properties to molecular motifs that govern them. Therefore, we introduce $\textbf{C}$hemically $\textbf{I}$nformed $\textbf{L}$anguage $\textbf{T}$ransformer ($\textbf{CILT}$) that utilizes hundreds of programmatically derived molecular motifs as a weak supervision prior. CILT leverages these motifs together with property descriptions to generate a chemically interpretable embedding space that clusters with respect to chemical motifs. This unified design enables CILT to quickly adapt to new motifs, properties, and perform classification, regression, and conditional generation. CILT showcases competitive performance while increasing the interpretability and requiring a 2-3 orders of magnitude fewer molecules.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 25048
Loading