Keywords: discrete diffusion, conditional generation, molecular generation, drug discovery
TL;DR: ProtoBind-Diff is a masked diffusion model that generates molecules conditioned on protein sequence embeddings, enabling target-specific ligand discovery without requiring 3D structures for training.
Abstract: Designing small molecules that selectively bind to protein targets remains a central challenge in drug discovery. While recent generative models leverage 3D structural data to guide ligand generation, their applicability is limited by the sparsity and bias of experimentally determined complexes. Here, we introduce ProtoBind-Diff, a structure-free masked diffusion model that conditions molecular generation directly on protein sequences via pre-trained language model embeddings. Trained on over one million active protein-ligand pairs from BindingDB, ProtoBind-Diff generates chemically valid, novel, and target-specific ligands without requiring 3D structures for inference. In extensive benchmarking against 3D structure-based models, ProtoBind-Diff achieves competitive predicted binding affinity scores and performs well on challenging targets, including those with limited training data. Despite never being trained on the data that contain binding pockets, its attention maps align with contact residues, suggesting the model learns spatially meaningful interaction priors from sequence alone. These results demonstrate that sequence-conditioned diffusion can enable structure-free, scalable ligand discovery across the proteome, including orphan or rapidly emerging targets.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 19331
Loading