SimpleDesign - A Joint Model for Protein Sequence and Structure Codesign

Jiarui Lu; Yuyang Wang; Yizhe Zhang; Jiatao Gu; Navdeep Jaitly; Joshua M. Susskind; Miguel Ángel Bautista

SimpleDesign - A Joint Model for Protein Sequence and Structure Codesign

Jiarui Lu, Yuyang Wang, Yizhe Zhang, Jiatao Gu, Navdeep Jaitly, Joshua M. Susskind, Miguel Ángel Bautista

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Generative models, multi-modality, de-novo generation, protein

TL;DR: A multi-modal generative model for protein co-design without tokenizers

Abstract: Proteins are fundamental to biological processes, with their function determined by the complex interplay between the amino acid sequence and the three-dimensional structure. Developing generative models capable of understanding this intrinsically multi-modal relationship is crucial for fields like drug discovery and protein engineering. Existing models often rely on a multi-stage training process where autoencoders that tokenize data into latent representations are trained in a first stage. Secondly, a generative model is trained on the latent representation of the autoencoder(s), i.e., generative modeling in a latent space. We hypothesize that this multi-stage training process is not required to obtain performant co-design models and thus present SimpleDesign, an effective multi-modal protein design model trained directly in the raw data space. SimpleDesign leverages a simple end-to-end training objective with two terms, a discrete cross-entropy for protein sequences and a continuous flow-matching regression objective for protein structures. In order to better model the sequence and structure modalities, we develop a Mixture-of-Transformer architecture that allows modality-specific processing while keeping global self-attention over both modalities. We train SimpleDesign on 1.8M sequence-structure pairs achieving strong performance across co-design and unconditional sequence/structure generation benchmarks.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 9626

Loading