Abstract: Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce our **S**pecies-**P**rofile **A**daptive **C**ollaborative **E**xperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners.
Lay Summary: Existing methods for training DNA analysis models often rely on learning patterns from raw DNA sequences alone, much like memorizing letters without context. However, these approaches struggle because DNA’s true function depends on dynamic biological factors, such as how tightly packed the DNA is in a cell (chromatin accessibility), which aren’t captured by sequence data alone.
We propose a new strategy: instead of analyzing sequences in isolation, we train models to predict these critical biological factors directly. To handle the complexity of diverse species and multiple biological factors, we designed SPACE, a model that uses specialized “expert” modules. Each expert focuses on a specific species or biological feature, then collaborates to build a cohesive understanding of DNA.
SPACE outperforms existing methods in tasks like predicting gene activity and disease links, proving that integrating biological context into training produces more accurate DNA models. This breakthrough could accelerate research in genetics, medicine, and biotechnology by providing tools that better decode how DNA orchestrates life.
Link To Code: https://github.com/ZhuJiwei111/SPACE
Primary Area: Applications->Health / Medicine
Keywords: DNA Foundation models, mixture of experts, genomic profile prediction, DNA, biology
Submission Number: 4990
Loading