Bi-semantic Chemical Embedder for Joint Representation Learning of SMILES and Natural Language

Published: 30 May 2026, Last Modified: 30 May 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: Representation Learning, Contrastive Learning, Language Models, Embeddings
Abstract: Transformer models have revolutionized NLP, and text-based molecular representations like SMILES have successfully extended these architectures to chemistry. However, domain-adaptive pretraining often causes models to overfit to chemical syntax, catastrophically forgetting their foundational semantic capabilities. To address this challenge, we introduce CheMatE, a chemistry-oriented embedding model that jointly captures molecular structure and domain-specific natural language within the same representation space. Initialized from a ModernBERT backbone, CheMatE learns bi-semantic representations through a two-stage training objective: continued masked language modeling (MLM) followed by a Matryoshka contrastive learning via Multiple Negative Ranking Loss (MNRL). Firstly, we train the model using MLM on a novel, large-scale corpus of SMILES-annotated, long-context scientific documents that we constructed and curated from FineWeb and ChemPile (comprising 10.4B and 11.5B tokens, respectively). Subsequently, the model undergoes contrastive learning using a synthetic dataset of SMILES-text pairs algorithmically derived from our original training corpus. This design exposes the model to SMILES-enriched scientific literature, enabling bi-semantic understanding. We evaluate CheMatE across a range of downstream tasks covering molecular property prediction and scientific language understanding. Our results demonstrate that coupling our custom-curated datasets with this sequential training strategy yields robust, highly transferable representations. By effectively unifying structural and contextual signals within a single text-based framework, CheMatE achieves competitive performance against both specialized chemistry models and general-purpose encoder baselines.
Submission Number: 124
Loading