Keywords: Data Augmentation, RNA, Genomics, Genomic Language Models, Bioinformatics
TL;DR: A data augmentation tool to inject evolutionary information into Genomic Language Models
Abstract: Genomic Language Models (GLMs) suffer from the inherent problem of data scarcity, due to the cost, time and complexity of wet-lab experiments. Data Augmentation offers a solution; however traditional methods often disrupt the structural and functional properties of biological sequences. Furthermore, current GLMs struggle to capture evolutionary dynamics through standard data pipelines, limiting their understanding of nucleotide-wise importance and constraints.
To address this, we present PhyloAug, a structure-aware, evolution-inspired augmentation method grounded in neutral theory. PhyloAug leverages Genomic Foundation Models (GFMs) to accurately perturb RNA sequences, guided by phylogenetic analysis via PAML to identify evolutionarily neutral site-wise positions where mutations are unlikely to affect function. These sites are concatenated with RNA secondary structures, ensuring that augmentations respect native structural constraints while embedding signals of neutral evolution. We further validate our method through a direct comparison of predicted neutral sites with Rfam-annotated conserved regions.
We demonstrate that by enriching training data with these evolution-guided augmentations, PhyloAug improves GFMs on well-established RNA benchmark tasks, and further enables GFMs to internalise conserved sequence patterns and evolutionary constraints. We demonstrate this through by establishing a novel task requiring evolutionary reasoning, conserved site detection. PhyloAug demonstrates significant performance improvements of up to 12.9% MCC and 17.2% F1-Score across our key tasks.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 21880
Loading