NovoMolGen: Rethinking Molecular Language Model Pretraining

Published: 11 Jun 2025, Last Modified: 18 Jul 2025GenBio 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: De Novo Molecular Generation, Molecular Language Model
TL;DR: We introduce NovoMolGen, a family of molecular foundation models for de novo molecule generation. It outperforms prior models in de novo and goal-directed generation tasks, providing valuable insights for Mol-LLMs and drug discovery.
Abstract: Designing \denovo molecules with desired properties requires efficient exploration of an immense chemical space spanning $10^{23}$ to $10^{60}$ potential candidates. Although Molecular Large Language Models (Mol-LLMs) enable scalable exploration using string-based representations, the effects of language modeling practices such as tokenization, model size, and dataset scale on molecular generation performance remain unclear. In this study, we introduce NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules, to systematically investigate these key factors. Our analyses demonstrate a weak correlation between standard pretraining metrics and downstream molecular generation performance, highlighting critical differences compared to general NLP models. NovoMolGen achieves state-of-the-art results, outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecule generation tasks.
Submission Number: 36
Loading