Keywords: Chemical Language Models (CLMs), Natural Products (NPs), State Space Models, Mamba, Mamba-2, GPT, molecule generation, tokenization, property prediction
TL;DR: We apply the latest state space models to chemical language models for Natural Products, comparing Mamba, Mamba-2, and GPT for molecule generation and property prediction, discussing the impact of architecture, tokenization, and training strategies.
Abstract: Language models are increasingly applied in scientific domains such as chemistry, where chemical language models (CLMs) are well-established for predicting molecular properties or generating de novo compounds for small molecules. However, Natural Products (NPs)---such as penicillin, morphine, and quinine, which have driven major breakthroughs in medicine---have received limited attention in CLM research. This gap limits the potential of NPs as a source of new therapeutics. To bridge this gap, we develop Natural Product–specific CLMs (NPCLMs) by pre-training the latest state-space model variants, Mamba and Mamba-2, which have shown great potential in modeling information-dense sequences, and compare them with transformer baselines (GPT). Using the largest known collection of $\sim$1M NPs, we provide the first extensive experimental comparison of selective state-space models (S6) and transformers in NP-focused tasks, along with a comparison of eight tokenization strategies, including character-level, Atom-in-SMILES (AIS), general byte-pair encoding (BPE) and NP-specific byte-pair encoding (NPBPE). Model performance is evaluated on two tasks: molecule generation, measured by validity, uniqueness, and novelty, and property prediction (peptide membrane permeability, taste, and anti-cancer activity), evaluated using Matthews Correlation Coefficient (MCC) and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). The results show that Mamba consistently generates 1–2\% more valid and unique molecules than Mamba-2 and GPT, while making 3-6\% less long-range dependency errors; however, GPT produces $\sim$2\% more novel structures. In property prediction, both Mamba and Mamba-2 outperform GPT by a modest but consistent 0.02 to 0.04 improvement in MCC under random splitting. Under stricter scaffold splitting, which groups molecules by core structure to better assess generalization to new scaffolds, all models perform comparably. In addition, chemically informed tokenization further enhances performance. For comparison, we include general-domain CLMs (ChemBERTa-2 and MoLFormer) and found that pre-training on $\sim$1M NPs achieves results on par with general CLMs trained on datasets over 100 times larger, emphasizing the value of domain-specific pre-training and data quality over scale in chemical language modeling.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 16342
Loading