Abstract: The canonical genetic code is degenerate, with most amino acids encoded by
multiple synonymous codons whose choice can influence translation, RNA stability,
and protein expression. Despite this complexity, the underlying rules linking codon
usage to molecular phenotypes remain poorly captured by existing models. Here,
we introduce the EnCodon model series within CodonFM, a family of large
foundation models trained on more than 130 million coding sequences spanning
over 22,000 species, designed to learn the contextual grammar of codon usage
directly from sequence. EnCodon models exhibit clear scaling behavior, with larger
models showing lower normalized confusion scores across synonymous codons,
revealing an emergent understanding of synonymous codon grammar. In zero-shot
settings, EnCodon achieves state-of-the-art performance across diverse
benchmarks, including prediction of de novo missense mutation pathogenicity,
clinical missense mutation classification, and ClinVar synonymous variant
discrimination. EnCodon generalizes to downstream mRNA design tasks, accurately
predicting translation efficiency and protein expression from sequence context.
Together, these results demonstrate that learning the intrinsic grammar of codon
usage is sufficient to infer a broad spectrum of biological and clinical effects,
establishing EnCodon as a scalable foundation for modeling translation and
RNA-driven gene regulation.
Loading