Energy-Guided Diffusion for Valid SMILES Generation

Ivan Gurev; Nikolay Nikitin

Energy-Guided Diffusion for Valid SMILES Generation

Ivan Gurev, Nikolay Nikitin

Published: 02 Mar 2026, Last Modified: 03 Apr 2026ReALM-GEN 2026 - ICLR 2026 WorkshopEveryoneRevisionsCC BY 4.0

Keywords: Diffusion Models, Molecular Generation, SMILES, Chemical Validity, Inference-Time Alignment

TL;DR: We fix chemical validity in SMILES diffusion models by steering their denoising process toward valid molecular structures during inference, without retraining.

Abstract: Diffusion models provide a flexible framework for molecular generation, yet their application to SMILES sequences is fundamentally constrained by chemical va- lidity. In continuous diffusion over token embeddings, denoising trajectories of- ten drift off the discrete manifold of valid SMILES, producing syntactic errors, chemical violations, and corrupted stereochemistry after decoding. We analyze these failure modes and show that they stem from a misalignment between smooth probability paths in embedding space and the rule-governed structure of symbolic molecular representations. We frame SMILES validity as an inference-time align- ment problem and interpret valid generation as sampling from a tilted distribution that reweights a base diffusion model toward structurally valid regions. Based on this perspective, we introduce validity-aware diffusion mechanisms that combine auxiliary training objectives with energy-based guidance during sampling, steer- ing diffusion trajectories toward the valid SMILES manifold without changing the underlying representation or retraining the base model. Experiments demon- strate substantial improvements in SMILES validity while preserving diversity and novelty, showing that inference-time aligned diffusion can be competitive with autoregressive and masked language models for molecular string generation and suggesting broader applicability to structured symbolic domains such as code and discrete diffusion language models.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 41

Loading