PhyloAug: Injecting Evolutionary information into GLMs via Data Augmentation

Jack Cole; Heng Yang; Ke Li

PhyloAug: Injecting Evolutionary information into GLMs via Data Augmentation

Jack Cole, Heng Yang, Ke Li

19 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Augmentation, RNA, Genomics, Genomic Language Models, Bioinformatics

TL;DR: A data augmentation tool to inject evolutionary information into Genomic Language Models

Abstract: Genomic Language Models (GLMs) suffer from the inherent problem of data scarcity, due to the cost, time and complexity of wet-lab experiments. Data Augmentation offers a solution; however traditional methods often disrupt the structural and functional properties of biological sequences. Furthermore, current GLMs struggle to capture evolutionary dynamics through standard data pipelines, limiting their understanding of nucleotide-wise importance and constraints. To address this, we present PhyloAug, a structure-aware, evolution-inspired augmentation method grounded in neutral theory. PhyloAug leverages Genomic Foundation Models (GFMs) to accurately perturb RNA sequences, guided by phylogenetic analysis via PAML to identify evolutionarily neutral site-wise positions where mutations are unlikely to affect function. These sites are concatenated with RNA secondary structures, ensuring that augmentations respect native structural constraints while embedding signals of neutral evolution. We further validate our method through a direct comparison of predicted neutral sites with Rfam-annotated conserved regions. We demonstrate that by enriching training data with these evolution-guided augmentations, PhyloAug improves GFMs on well-established RNA benchmark tasks, and further enables GFMs to internalise conserved sequence patterns and evolutionary constraints. We demonstrate this through by establishing a novel task requiring evolutionary reasoning, conserved site detection. PhyloAug demonstrates significant performance improvements of up to 12.9% MCC and 17.2% F1-Score across our key tasks.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 21880

Loading