NovoMolGen: Rethinking Molecular Language Model Pretraining

Kamran Chitsaz; Roshan Balaji; Quentin Fournier; Nirav Pravinbhai Bhatt; Sarath Chandar

NovoMolGen: Rethinking Molecular Language Model Pretraining

Kamran Chitsaz, Roshan Balaji, Quentin Fournier, Nirav Pravinbhai Bhatt, Sarath Chandar

Published: 11 Jun 2025, Last Modified: 18 Jul 2025GenBio 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: De Novo Molecular Generation, Molecular Language Model

TL;DR: We introduce NovoMolGen, a family of molecular foundation models for de novo molecule generation. It outperforms prior models in de novo and goal-directed generation tasks, providing valuable insights for Mol-LLMs and drug discovery.

Abstract: Designing \denovo molecules with desired properties requires efficient exploration of an immense chemical space spanning $10^{23}$ to $10^{60}$ potential candidates. Although Molecular Large Language Models (Mol-LLMs) enable scalable exploration using string-based representations, the effects of language modeling practices such as tokenization, model size, and dataset scale on molecular generation performance remain unclear. In this study, we introduce NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules, to systematically investigate these key factors. Our analyses demonstrate a weak correlation between standard pretraining metrics and downstream molecular generation performance, highlighting critical differences compared to general NLP models. NovoMolGen achieves state-of-the-art results, outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecule generation tasks.

Submission Number: 36

Loading