Keywords: conformation generation, language model, computational biology
Abstract: Generating accurate 3D conformations of small molecules from their 2D representations is a central task in computational drug discovery, impacting molecular docking, virtual screening, and property prediction. While most recent advances rely on diffusion-based generative models, these methods come with limitations such as slow sampling and architectural rigidity. In this work, we demonstrate that autoregressive language models can effectively learn to generate 3D molecular conformations from text-only data. We propose a simple yet expressive representation that combines canonical SMILES with raw 3D atomic coordinates in a unified tokenized format. Using this approach, we train language models ranging from 100 million to 1 billion parameters on a dataset curated from GEOM-Drugs. While our models currently perform slightly behind the best diffusion-based methods, they achieve competitive results and show consistent improvements with scale. We derive empirical scaling laws demonstrating that generation quality improves predictably with model size and data, suggesting a clear path toward closing the performance gap. These findings indicate that language models are a scalable and flexible alternative for 3D molecular generation, with potential for further improvement through recent advancements in large language models, such as in-context learning and post-training adaptation.
Submission Number: 102
Loading