Keywords: Protein, Protein Design, Binder, Ligand, Language models
TL;DR: Developement, benchmarking, and tools for ligand-binding protein design with sequence to sequence protein language models
Abstract: Proteins can bind small molecules with high specificity yet designing binders for user-defined ligands remains challenging and typically relies on structural information and costly experimental iteration. While protein language models (pLMs) have shown promise for unconditional generation and coarse label–controlled design, instance-level conditioning on a specific target ligand has not been systematically evaluated using purely textual inputs. Here we frame small-molecule protein binder design as a sequence-to-sequence translation problem and train ligand-conditioned pLMs that map molecular strings to candidate binder sequences. We curate large-scale ligand–protein datasets spanning distinct data regimes (>17M ligand-protein paris) and train a suite of models, spanning 16 to 700M parameters. Results reveal a consistent trade-off driven by supervision ambiguity: when each ligand is paired with few proteins, models generate near-neighbour, foldable sequences; when each ligand is paired with many proteins, generations are more diverse but less consistently foldable. Our study exposes how annotation diversity and sampling choices elicit this behaviour, and how it changes when we modify the data distribution. These insights highlight dataset redundancy and incompleteness as key bottlenecks for sequence-only binder design. We release the curated datasets, trained models, and evaluation tools to support future work on ligand-conditioned protein generation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 30
Loading