Abstract: The genetic code is highly redundant, with many synonymous codons encoding the same amino acid. Codon usage influences RNA structure, signaling, and translation rates. Differences in tRNA availability modulate elongation, with rare codons slowing translation and affecting co-translational folding and gene expression. Despite their functional importance and non-random distribution, rare codons are underrepresented in natural datasets, restricting the development of predictive models. We developed a transformer-based model that predicts codon sequences from amino acids, substantially improving rare codon prediction. The model learns codon signatures encoding species identity, RNA thermodynamic properties, and elongation constraints without explicit labels. Attention analysis shows that codon choice depends on both short and long-range sequence contexts, recovering dicodon effects and highlighting additional motifs. Finally, predictions correlate with experimental measurements of the impact of synonymous mutations on protein fitness, linking gene sequence to fitness and functional consequences, providing a framework to connect sequence variation, translation, and protein function.Download figureOpen in new tab
External IDs:doi:10.64898/2026.03.28.714798
Loading