Chemical Priors at Scale: Efficient Foundation Models without Big Corpora

ICLR 2026 Conference Submission25048 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Molecular language modeling, chemically-informed self-supervision, scientific foundation models
Abstract: We achieve competitive molecular property prediction using up to two orders of magnitude fewer pretraining molecules by replacing generic masked language modeling with chemically-informed, task-conditioned self-supervision. Our **C**hemicaly **I**nformed **L**anguage **T**ransformer (**CILT**) learns from 300+ programmatically-derived chemical tasks (functional groups, substructure counts, molecular properties) paired with natural language descriptions. During pretraining, the model alternates between predicting masked SMILES tokens conditioned on task descriptions and predicting property values conditioned on molecules, creating a unified architecture for generation, regression, and classification driven by text prompts. This approach yields three key advantages. First, despite using orders of magnitude less data, we match state-of-the-art performance on MoleculeNet benchmarks. Second, the learned representations exhibit chemical interpretability: embeddings cluster by functional groups without explicit supervision, while attention mechanisms route from task descriptions to chemically-relevant atoms. Third, the model demonstrates predictable zero-shot generalization—adaptation speed correlates with semantic similarity between task descriptions, enabling rapid few-shot learning on unseen substructures. Our results demonstrate that structured domain knowledge, encoded through natural language, can substitute for scale in scientific foundation models---establishing a blueprint for data-efficient pretraining in chemistry and beyond.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 25048
Loading