Prot2RNA: A Diffusion Language Model for Protein-Conditioned mRNA Coding Sequence Generation

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: mRNA design, Codon optimization, Diffusion language model
TL;DR: Prot2RNA is a diffusion language model that generates protein-conditioned mRNA coding sequences and learns codon preferences and other biological features of highly expressed transcripts.
Abstract: The redundancy of the genetic code, where multiple codons encode the same amino acid, creates a vast design space for messenger RNA (mRNA) sequences. Synonymous codon choices significantly affect mRNA stability, structure, translation efficiency, and immunogenicity, all critical for mRNA therapeutics and synthetic biology. We present Prot2RNA, a diffusion language model that generates mRNA coding sequences conditioned on a target protein. Prot2RNA uses a two-stage training approach: the model is pretrained using masked diffusion modeling over separate sets of human protein and mRNA coding sequences, learning representations for both biological modalities in a shared space. Subsequently, the model is finetuned to generate codon sequences using target protein sequences as prompts. Prot2RNA was trained on human data and evaluated on a held-out set of highly expressed mRNA transcripts that are sequentially different from the training set. The results demonstrate that our diffusion-based codon optimization model outperforms existing methods in codon-level accuracy, alignment with biologically meaningful properties, and its ability to generate sequence profiles that closely mirror codon usage patterns in highly expressed wild-type human mRNAs. Unlike other deep learning models that primarily learn codon usage frequency, Prot2RNA implicitly learns biologically relevant codon preferences, providing a strong foundation for protein-aware mRNA design.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 23705
Loading