EmbedMol: An Open Billion-scale Molecular Embedding Dataset for Molecular Discovery

ICLR 2026 Conference Submission21937 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large-scale datasets, embedding, molecular discovery, open source
Abstract: Modern molecular libraries span billions of compounds, exposing a mismatch between dataset scale and the practicality of vHTS. SMILES strings remain the dominant representation, but while easy to store, they are difficult to consume at billion scale: each search or training run must first translate SMILES into learned features, incurring prohibitive overhead. We introduce \emph{EmbedMol}, the first open billion-scale dataset of precomputed molecular embeddings, along with a scalable generation pipeline. \emph{EmbedMol} comprises 977M embeddings from GDB13 and 11.2B embeddings from GDB13+ZINC22, generated with a deep model pretrained on experimental binding assays. Our contribution is not a new encoder, but a benchmark/dataset resource that makes billion-scale embedding-based retrieval practical. We demonstrate that precomputed vectors act as a faithful, efficient proxy for expensive inference, yielding up to \textbf{37.3$\times$} speedups versus classical fingerprints and \textbf{1.5$\times$} versus re-running the encoder, while maintaining strong retrieval quality across multiple targets. Beyond efficiency, \emph{EmbedMol} establishes a testbed for billion-scale evaluation of retrieval methods, scaling behavior, and cross-target generalization in molecular discovery. To support reproducibility and accessibility, we release not only the dataset and loaders but also a fully automated AWS-based pipeline, enabling researchers with varying levels of distributed-systems expertise to reproduce and extend \emph{EmbedMol}.
Primary Area: datasets and benchmarks
Submission Number: 21937
Loading