Keywords: pretraining task, genome language models
TL;DR: We introduce two pretraining tasks for genome language models to predict evolutionary rate from sequence that, in some cases, can outperform sequence-only training and, when combined with it, yield better performing gLMs.
Abstract: Genome language models (gLM) have the potential to encode how and when genes are regulated without requiring labeled data. Most gLMs are pretrained using genome sequence reconstruction tasks inspired by natural language processing, such as masked language modeling (MLM) or next token prediction (NTP). Recent studies have shown that these gLMs often fail to capture biological signal, showing limited gains over simple classifiers on raw sequence or randomly initialized models on downstream genomic prediction tasks. To address these limitations, we explored alternative pretraining tasks for gLMs. Evolutionary rate has historically been the strongest predictor of function in genomics, but to date, there has been limited investigation of pretraining tasks exploiting evolution. Here, we introduce two evolution-based pretraining tasks that predict the rate of evolution from genomic sequence: current evolution prediction and masked evolution modeling. These tasks are designed so that they can be combined with NTP and MLM, enabling a systematic assessment of predicting sequence only, evolutionary rate only, or both. Using a novel suite of benchmarks that balance distinct aspects of genome function, we show that training on both sequence and evolutionary rate outperforms training on sequence alone. Moreover, for many tasks, training on evolutionary rate alone outperforms training on sequence alone. These results demonstrate that evolution-based pretraining offers a principled alternative or additional task to sequence reconstruction, establishing evolution as a key training target for genome-scale models.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 18861
Loading