Learning the Language of Codon Translation with CodonFM

Sajad Darabi

Published: 08 Oct 2025, Last Modified: 01 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: The canonical genetic code is degenerate, with most amino acids encoded by multiple synonymous codons whose choice can influence translation, RNA stability, and protein expression. Despite this complexity, the underlying rules linking codon usage to molecular phenotypes remain poorly captured by existing models. Here, we introduce the EnCodon model series within CodonFM, a family of large foundation models trained on more than 130 million coding sequences spanning over 22,000 species, designed to learn the contextual grammar of codon usage directly from sequence. EnCodon models exhibit clear scaling behavior, with larger models showing lower normalized confusion scores across synonymous codons, revealing an emergent understanding of synonymous codon grammar. In zero-shot settings, EnCodon achieves state-of-the-art performance across diverse benchmarks, including prediction of de novo missense mutation pathogenicity, clinical missense mutation classification, and ClinVar synonymous variant discrimination. EnCodon generalizes to downstream mRNA design tasks, accurately predicting translation efficiency and protein expression from sequence context. Together, these results demonstrate that learning the intrinsic grammar of codon usage is sufficient to infer a broad spectrum of biological and clinical effects, establishing EnCodon as a scalable foundation for modeling translation and RNA-driven gene regulation.